pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
fduwjj	dc8bb2636c	[c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132920 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-08-09 21:08:20 +00:00
Edward Z. Yang	1f66487c69	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh	2024-08-08 23:07:23 +00:00
PyTorch MergeBot	d1f73fd844	Revert "[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 )" This reverts commit `902c6f3a19`. Reverted https://github.com/pytorch/pytorch/pull/132770 on behalf of https://github.com/ezyang due to Removed API was recommitted ([comment](https://github.com/pytorch/pytorch/pull/132770#issuecomment-2275749689))	2024-08-08 12:54:34 +00:00
Edward Z. Yang	902c6f3a19	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062, #132767, #132769	2024-08-08 12:03:25 +00:00
Edward Z. Yang	aec6332356	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-08 12:03:06 +00:00
Edward Z. Yang	361db32d47	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-08 12:02:54 +00:00
daitian1995	aff48f7378	Autoselect default device in FSDP construction. (#127609 ) There are still some differences between CUDA and non-CUDA custom devices when construct FSDP because CUDA is selected as the default device. For example, when construct FSDP from CPU model and device_id is not passed, device_handle will choose CUDA as default device. This PR will autoselect the real device as the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609 Approved by: https://github.com/awgu	2024-08-08 05:25:17 +00:00
PyTorch MergeBot	a9ff190867	Revert "Consolidate SymDispatchMode into ProxyTensorMode (#132674 )" This reverts commit `ffdf48e63b`. Reverted https://github.com/pytorch/pytorch/pull/132674 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))	2024-08-07 18:25:33 +00:00
PyTorch MergeBot	780310fed7	Revert "Only thunkify proxies in some situations (#132421 )" This reverts commit `bb99008c9e`. Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](`bb99008c9e`). Test got added in `f50621989b` which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))	2024-08-07 15:29:54 +00:00
Edward Z. Yang	bb99008c9e	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-07 11:51:17 +00:00
Edward Z. Yang	ffdf48e63b	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-06 17:03:17 +00:00
Wouter Devriendt	e8645fa2b9	[Doc] fix some typos (found by codespell and typos) (#132544 ) Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544 Approved by: https://github.com/kit1980	2024-08-05 17:21:56 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
Luca Wehrstedt	f4f7aba75d	Expose function to probe whether PyTorch was built with FlashAttention (#131894 ) This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-07-31 11:33:09 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00
Simon Mahns	dcb03106b7	[Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007 ) Summary: as title Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962 Differential Revision: D60335413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007 Approved by: https://github.com/hanzlfs, https://github.com/egienvalue	2024-07-29 20:47:18 +00:00
PyTorch MergeBot	eb9409511e	Revert "support zb1p and zb2p algorithms (#130752 )" This reverts commit `8fe5b93667`. Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](`8fe5b93667`) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))	2024-07-29 12:40:00 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `709ddf7a9d`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
PyTorch MergeBot	b343644f3a	Revert "MTIA equivalent of torch.cuda.memory_stats (#131673 )" This reverts commit `513ce5f69a`. Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))	2024-07-26 00:54:37 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
Simon Mahns	513ce5f69a	MTIA equivalent of torch.cuda.memory_stats (#131673 ) Summary: Adding MTIA equivalent of `torch.cuda.memory_stats` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673 Approved by: https://github.com/egienvalue	2024-07-25 21:59:59 +00:00
Yanbo Liang	a34692c0a3	[Inductor] Added and_masks and or_masks utilities & make fully masked out rows 0 instead of nan (#131552 ) Combine #131073 and #131012 and fix doc building failures. Co-authored-by: chilli <chilli@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131552 Approved by: https://github.com/Chillee	2024-07-25 21:29:46 +00:00
Mikayla Gawarecki	c3d099ddd1	[BE][Easy] Add hooks to doc for Optimizer base class (#131628 ) Happened to notice this was missing from the base class (but is rendering for the other optimizers like Adam etc.) when I wanted to link the state_dict hooks for https://discuss.pytorch.org/t/global-not-per-param-optimizer-state/206769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131628 Approved by: https://github.com/janeyx99	2024-07-25 15:07:08 +00:00
Jane Xu	9c4cf866c2	Adafactor forloop basic impl (#129905 ) #109581 At this point, the vanilla implementation (the default) is good. Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor Specifically, the impl in this PR, which attempts to replicate the paper, ``` optim = torch.optim.Adafactor([weight]) ``` is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor ``` optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False) ``` is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor ``` optim = keras.optimizers.Adafactor(learning_rate=0.01) ``` The three results respectively for the same randomly generated weights: ``` # ours tensor([[ 0.3807594, -0.3912092], [ 0.0762539, 0.5377805], [ 0.2459473, 0.4662207]]) # pytorch-optimizer tensor([[ 0.3807592, -0.3912172], [ 0.0762507, 0.5377818], [ 0.2459457, 0.4662213]]) # keras array([[ 0.38076326, -0.39121315], [ 0.0762547 , 0.5377859 ], [ 0.24594972, 0.46622536]], dtype=float32) ``` This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences: * keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01. * We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling. <details> Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing My script repro: ``` import torch from pytorch_optimizer import AdaFactor torch.set_printoptions(precision=7) weight = torch.tensor([[ 0.37697506, -0.39500135], [ 0.07246649, 0.53399765], [ 0.24216151, 0.46243715]], dtype=torch.float32) # bias = torch.tensor([0, 0], dtype=torch.float32) weight.grad = torch.tensor([[-0.5940447, -0.7743838], [-0.5940447, -0.7743838], [-0.5940447, -0.7743838]], dtype=torch.float32) # bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32) weight_c = weight.clone() weight_c.grad = weight.grad.clone() optim = torch.optim.Adafactor([weight]) optim.step() print(weight) optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False) optim_c.step() print(weight_c) ``` <details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905 Approved by: https://github.com/albanD	2024-07-25 13:17:19 +00:00
Haoci Zhang	8fe5b93667	support zb1p and zb2p algorithms (#130752 ) Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported. We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached. [zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752 Approved by: https://github.com/H-Huang	2024-07-24 17:58:46 +00:00
Jun Luo	abb313b466	[torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873 ) Summary: As title Test Plan: CI tests Reviewed By: joebos Differential Revision: D59036602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873 Approved by: https://github.com/hanzlfs	2024-07-24 01:52:21 +00:00
Shangdi Yu	68c725a094	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 17:48:38 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `5b5e0698a5`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
PyTorch MergeBot	b435d84261	Revert "[custom ops] Add register_vmap for custom ops (#130589 )" This reverts commit `074b420641`. Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))	2024-07-23 01:44:44 +00:00
Shangdi Yu	074b420641	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 00:54:52 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
PyTorch MergeBot	26383a6cc0	Revert "Added and_masks and or_masks utilities (#131073 )" This reverts commit `92bb323d36`. Reverted https://github.com/pytorch/pytorch/pull/131073 on behalf of https://github.com/albanD due to The docs build fails here and in trunk ([comment](https://github.com/pytorch/pytorch/pull/131073#issuecomment-2242997958))	2024-07-22 13:44:55 +00:00
chilli	92bb323d36	Added and_masks and or_masks utilities (#131073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131073 Approved by: https://github.com/drisspg ghstack dependencies: #130871, #130904	2024-07-22 11:48:03 +00:00
Soumith Chintala	8e478d4fb1	Add Alban and Piotr into Core Maintainers (#130903 ) See official announcement here: https://dev-discuss.pytorch.org/t/alban-desmaison-and-piotr-bialecki-are-now-pytorch-core-maintainers/2280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130903 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-20 16:02:42 +00:00
Li-Huai (Allan) Lin	125be005eb	[Docs] Fix fake tensor doc (#131205 ) Fix this: `# AttributeError: 'FakeTensorMode' object has no attribute 'from_real_tensor'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131205 Approved by: https://github.com/eellison	2024-07-19 17:59:45 +00:00
Jerry Zhang	793b17ebcb	Add numeric_debugger top level APIs (#130643 ) Summary: Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model and calculate summary for metric comparisons between nodes in two graphs * `prepare_for_propagation_comparison` * `extract_results_from_loggers` * `compare_results` Test Plan: python test/test_quantization.py -k test_prepare_for_propagation_comparison python test/test_quantization.py -k test_extract_results_from_loggers Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-18 20:54:18 +00:00
redwrasse	63a0a65df9	Define 'zero-preserving unary functions' in docs (#130804 ) Make explicit the definition of 'zero-preserving unary functions' in the sparse tensors documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130804 Approved by: https://github.com/soulitzer	2024-07-18 13:30:29 +00:00
drisspg	2b43d339fe	Make FlexAttention API public (#130755 ) # Summary Makes the prototype API flex_attention public Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755 Approved by: https://github.com/Chillee	2024-07-16 16:21:25 +00:00
Xuehai Pan	a3abfa5cb5	[BE][Easy][1/19] enforce style for empty lines in import segments (#129752 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129752 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-16 00:42:56 +00:00
Jerry Zhang	b893aa71ca	Rename generate_numeric_debug_handle to numeric_debugger (#130590 ) Summary: att Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-15 22:42:27 +00:00
Yu, Guangye	7cd48df2da	Refine the logic of device construction when only device index is given (#129119 ) # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119 Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang ghstack dependencies: #129463, #129205, #129363	2024-07-15 14:34:29 +00:00
Yu, Guangye	9cae2160f5	Introduce the concept of Accelerators to PyTorch doc (#129363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #129463, #129205	2024-07-15 14:24:46 +00:00
Mikayla Gawarecki	7c289c2a5c	Add torch.serialization.safe_globals context manager (#127939 ) Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939 Approved by: https://github.com/albanD	2024-07-12 20:38:43 +00:00
rzou	9c69684af8	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-12 14:13:01 +00:00
Shangdi Yu	fb9bc6d74a	[custom op] add doc for CustomOpDef.set_kernel_enabled (#130406 ) <img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096"> <img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406 Approved by: https://github.com/zou3519	2024-07-11 15:47:35 +00:00
Shangdi Yu	a4576dad34	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-11 03:39:07 +00:00
PyTorch MergeBot	86bca69c5f	Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261 )" This reverts commit `bb9a73f767`. Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))	2024-07-10 21:43:28 +00:00
PyTorch MergeBot	e14a0f45ed	Revert "[reland][custom ops] infer schema (#130079 )" This reverts commit `bef085bdfa`. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))	2024-07-10 21:40:16 +00:00
Shangdi Yu	bef085bdfa	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-10 16:18:36 +00:00
rzou	bb9a73f767	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-09 21:11:27 +00:00
Yuanhao Ji	312652c325	[RFC] Add support for device extension autoloading (#127074 ) Fixes #122468 - Load device extensions at the end of `torch/__init__.py` - Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0` run test: ```python python test/run_test.py -i test_autoload_enable python test/run_test.py -i test_autoload_disable ``` doc: https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html co-author: @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074 Approved by: https://github.com/albanD, https://github.com/jgong5	2024-07-09 06:14:13 +00:00
PyTorch MergeBot	44a773c121	Revert "[custom ops] infer schema (#130079 )" This reverts commit `3fe324ffb6`. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit `3fe324ffb6` ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))	2024-07-08 22:02:29 +00:00
Shangdi Yu	3fe324ffb6	[custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-08 20:46:23 +00:00
Kurt Mohler	e590168865	Enable sharing meta tensors between processes (#129520 ) Fixes #129436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129520 Approved by: https://github.com/ezyang	2024-07-04 20:29:48 +00:00
Li-Huai (Allan) Lin	42f3d7e948	[MPS] Add mps profiler env vars to docs (#129552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129552 Approved by: https://github.com/malfet ghstack dependencies: #129451	2024-07-04 06:44:48 +00:00
Zhengxu Chen	042d764872	[export] Update example inputs format for DB. (#129982 ) Summary: To give user a simpler example code, we are getting rid of ExportArgs in favor of example_args and example_kwargs. Test Plan: CI Differential Revision: D59288920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129982 Approved by: https://github.com/angelayi	2024-07-03 17:53:15 +00:00
Edward Z. Yang	29c68df600	Stop immediately specializing common constants 0/1 for plain int (#128327 ) Fixes https://github.com/pytorch/pytorch/issues/128319 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327 Approved by: https://github.com/lezcano ghstack dependencies: #129983	2024-07-03 16:41:51 +00:00
Howard Huang	4eb449f7dc	[pipelining] add small logging section to docs (#129368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129368 Approved by: https://github.com/wconstab	2024-07-02 18:19:28 +00:00
Haoci Zhang	1ad683033b	Implemented flexible PP schedule (#129597 ) Enabled some cases to work where num_microbatches % pp_size != 0. Using the flex_pp schedule, we will have num_rounds = max(1, n_microbatches // pp_group_size) and it works as long as n_microbatches % num_rounds is 0. As a few examples, support pp_group_size = 4, n_microbatches = 10. We will have num_rounds = 2 and n_microbatches % 2 is 0. pp_group_size = 4, n_microbatches = 3. We will have num_rounds = 1 and n_microbatches % 1 is 0. Moved over from PiPPy (https://github.com/pytorch/PiPPy/pull/1129) Tested using the config in (1), schedule looks like the following graph: ``` =========== ALL_RANK_ACTIONS =========== Rank 0 Rank 1 Rank 2 Rank 3 Step 00: F0_s0 None None None Step 01: F1_s0 F0_s1 None None Step 02: F2_s0 F1_s1 F0_s2 None Step 03: F3_s0 F2_s1 F1_s2 F0_s3 Step 04: F4_s0 F3_s1 F2_s2 F1_s3 Step 05: F0_s4 F4_s1 F3_s2 F2_s3 Step 06: F1_s4 F0_s5 F4_s2 F3_s3 Step 07: F2_s4 F1_s5 F0_s6 F4_s3 Step 08: F3_s4 F2_s5 F1_s6 F0_s7 Step 09: F4_s4 F3_s5 None B0_s7 Step 10: F5_s0 None F2_s6 F1_s7 Step 11: None None B0_s6 B1_s7 Step 12: None F4_s5 F3_s6 F2_s7 Step 13: None B0_s5 B1_s6 B2_s7 Step 14: F6_s0 F5_s1 F4_s6 F3_s7 Step 15: B0_s4 B1_s5 B2_s6 B3_s7 Step 16: F7_s0 F6_s1 F5_s2 F4_s7 Step 17: B1_s4 B2_s5 B3_s6 B4_s7 Step 18: F8_s0 F7_s1 F6_s2 F5_s3 Step 19: B2_s4 B3_s5 B4_s6 B0_s3 Step 20: F9_s0 F8_s1 F7_s2 F6_s3 Step 21: B3_s4 B4_s5 B0_s2 B1_s3 Step 22: F5_s4 F9_s1 F8_s2 F7_s3 Step 23: B4_s4 B0_s1 B1_s2 B2_s3 Step 24: F6_s4 F5_s5 F9_s2 F8_s3 Step 25: B0_s0 B1_s1 B2_s2 B3_s3 Step 26: F7_s4 F6_s5 F5_s6 F9_s3 Step 27: B1_s0 B2_s1 B3_s2 B4_s3 Step 28: F8_s4 F7_s5 F6_s6 F5_s7 Step 29: B2_s0 B3_s1 B4_s2 B5_s7 Step 30: F9_s4 F8_s5 F7_s6 F6_s7 Step 31: B3_s0 B4_s1 B5_s6 B6_s7 Step 32: None F9_s5 F8_s6 F7_s7 Step 33: B4_s0 B5_s5 B6_s6 B7_s7 Step 34: None None F9_s6 F8_s7 Step 35: B5_s4 B6_s5 B7_s6 B8_s7 Step 36: None None None F9_s7 Step 37: B6_s4 B7_s5 B8_s6 B9_s7 Step 38: None None None None Step 39: B7_s4 B8_s5 B9_s6 B5_s3 Step 40: None None None None Step 41: B8_s4 B9_s5 B5_s2 B6_s3 Step 42: None None None None Step 43: B9_s4 B5_s1 B6_s2 B7_s3 Step 44: None None None None Step 45: B5_s0 B6_s1 B7_s2 B8_s3 Step 46: None None None None Step 47: B6_s0 B7_s1 B8_s2 B9_s3 Step 48: None None None Step 49: B7_s0 B8_s1 B9_s2 Step 50: None None Step 51: B8_s0 B9_s1 Step 52: None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129597 Approved by: https://github.com/H-Huang	2024-07-02 07:54:38 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
PyTorch MergeBot	3d96217891	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `9e1f3ecaa7`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))	2024-06-29 00:47:15 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `b7e7a4cb01`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
Xuehai Pan	9e1f3ecaa7	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-28 00:35:15 +00:00
Li-Huai (Allan) Lin	84ad5452f6	[MPS] Fused SGD optimizer (#129350 ) ``` [-------------------------------------- Fused SGD --------------------------------------] \| Fused: True \| Fused: False 1 threads: ------------------------------------------------------------------------------ numel: 1024, num_tensors: 100, momentum: True \| 2 \| 15 numel: 1024, num_tensors: 100, momentum: False \| 2 \| 5 numel: 65536, num_tensors: 100, momentum: True \| 3 \| 16 numel: 65536, num_tensors: 100, momentum: False \| 2 \| 5 numel: 1048576, num_tensors: 100, momentum: True \| 11 \| 16 numel: 1048576, num_tensors: 100, momentum: False \| 8 \| 6 numel: 1024, num_tensors: 500, momentum: True \| 29 \| 70 numel: 1024, num_tensors: 500, momentum: False \| 20 \| 24 numel: 65536, num_tensors: 500, momentum: True \| 33 \| 76 numel: 65536, num_tensors: 500, momentum: False \| 22 \| 26 numel: 1048576, num_tensors: 500, momentum: True \| 70 \| 80 numel: 1048576, num_tensors: 500, momentum: False \| 43 \| 40 numel: 1024, num_tensors: 1000, momentum: True \| 108 \| 139 numel: 1024, num_tensors: 1000, momentum: False \| 72 \| 48 numel: 65536, num_tensors: 1000, momentum: True \| 116 \| 150 numel: 65536, num_tensors: 1000, momentum: False \| 77 \| 52 numel: 1048576, num_tensors: 1000, momentum: True \| 190 \| 170 numel: 1048576, num_tensors: 1000, momentum: False \| 120 \| 50 ``` ```python def profile_fused_sgd(): from torch.optim.sgd import sgd import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, momentum_buffer_list, fused): fn( params, grads, momentum_buffer_list, momentum=True if len(momentum_buffer_list) > 0 else False, dampening=0.0, nesterov=False, foreach=False, fused=fused, lr=1e-3, weight_decay=.0, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]): sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}" print(sublabel) params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)] momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else [] fn = sgd for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, momentum_buffer_list, fused)', label='Fused SGD', sub_label=sublabel, globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350 Approved by: https://github.com/janeyx99 ghstack dependencies: #129006, #129008, #129007, #129105	2024-06-27 04:37:14 +00:00
PyTorch MergeBot	895316119d	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `0314c4c101`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))	2024-06-26 19:03:57 +00:00
Shangdi Yu	cca85c96cd	[export] minor typo fix (#129543 ) Fixes a typo in torch.export doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129543 Approved by: https://github.com/angelayi	2024-06-26 18:35:31 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
Zhengxu Chen	e58ef5b65f	[export] Rewrite exportdb formatting. (#129260 ) Summary: It'll be easier to generate examples if the code doesn't depend on exportdb library. Test Plan: CI Differential Revision: D58886554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129260 Approved by: https://github.com/tugsbayasgalan	2024-06-25 21:04:53 +00:00
Li-Huai (Allan) Lin	71ebe5121a	[MPS] Fast math env var (#129007 ) Allow users to decide whether they want to have fast math enabled via env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/129007 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008	2024-06-25 13:52:07 +00:00
Xuehai Pan	0314c4c101	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-25 08:28:38 +00:00
Will Constable	2f8b301c32	Clean up distributed/CONTRIBUTING.md (#128450 ) Click [here](`cf6c88af48/torch/distributed/CONTRIBUTING.md`) to see the rendered version of the file in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/128450 Approved by: https://github.com/wanchaol	2024-06-22 02:41:22 +00:00
rzou	311fadb1fb	[docs] Redirect custom ops landing page to the correct place (#129177 ) I'm moving it to pytorch/tutorials Pull Request resolved: https://github.com/pytorch/pytorch/pull/129177 Approved by: https://github.com/albanD	2024-06-21 13:31:32 +00:00
cyy	5c676bb8b3	Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-06-21 06:16:44 +00:00
Jing Xu	5fba5d83f0	add xpu for amp (#127276 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to AMP doc. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127276 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/malfet	2024-06-20 21:49:35 +00:00
Zhengxu Chen	65286883d4	[export] reland "experimental joint graph API." (#129081 ) Summary: previous diff got reverted despite CI was green. Test Plan: CI Differential Revision: D58790048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129081 Approved by: https://github.com/tugsbayasgalan	2024-06-20 16:50:53 +00:00
Oguz Ulgen	54b0006cb2	Evaluate symexprs on load path of cache not write (#128997 ) When caching is enabled, an internal model fails with ``` assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1)) AssertionError: expected size 17==17, stride 57344==54784 at dim=0 ``` looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected. This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path. I have not been able to come up with a unit test repro for this problem. Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997 Approved by: https://github.com/ezyang	2024-06-20 08:55:12 +00:00
Li-Huai (Allan) Lin	19f3abcde4	[Docs][MPS] Add mps environment variable table (#129008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129008 Approved by: https://github.com/malfet ghstack dependencies: #129006	2024-06-20 03:30:35 +00:00
PyTorch MergeBot	df94d57c0a	Revert "[export] experimental joint graph API. (#128847 )" This reverts commit `0707811286`. Reverted https://github.com/pytorch/pytorch/pull/128847 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128847#issuecomment-2179326891))	2024-06-19 19:04:36 +00:00
Zhengxu Chen	0707811286	[export] experimental joint graph API. (#128847 ) Summary: WARNING: This API is highly unstable and will be subject to change in the future. Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph. Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental Differential Revision: D55657917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847 Approved by: https://github.com/tugsbayasgalan	2024-06-19 16:45:27 +00:00
Li-Huai (Allan) Lin	0fc603ece4	[optim] Fused implementation stability table (#129006 ) I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006 Approved by: https://github.com/malfet	2024-06-19 16:29:49 +00:00
soulitzer	1877b7896c	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) ### bc-breaking for existing users of the private API: - Existing policy functions must now change their return value to be [CheckpointPolicy](`c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230)`) Enum instead of bool. - To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy. - Policy function now accepts a `ctx` object instead of `mode` for its first argument. - To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE : metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-18 18:18:50 +00:00
Boyuan Feng	43998711a7	[CUDAGraph] add more docs for cudagraph trees (#127963 ) This PR adds more documentation for CUDAGraph Trees, including - Iteration Support - Input Mutation Support - Dynamic Shape Support - NCCL Support - Reasons for Skipping CUDAGraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/127963 Approved by: https://github.com/eellison	2024-06-18 02:07:07 +00:00
ibartol	c6b180a316	Created docs (and example) for cudart function in torch.cuda (#128741 ) Fixes #127908 ## Description Created docs to document the torch.cuda.cudart function to solve the issue #127908. I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise). Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake. ### Summary of Changes - Added docs for torch.cuda.cudart - Added the cudart function in the autosummary of docs/source/cuda.rst ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741 Approved by: https://github.com/msaroufim	2024-06-17 16:50:37 +00:00
BowenBao	ab13980424	[ONNX] Update 'person_of_interest.rst', 'CODEOWNERS' and 'merge_rules.yaml' (#126364 ) The following are all constrained under the ONNX exporter project scope. - `personal_of_interest.rst` - Moving folks no longer working on the project to emeritus. - Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre, who have all made countless contributions to this project. - `CODEOWNERS` - Removing folks no longer working on the project. - Updating new owners who will now be notified with PRs related to the specific file paths. - `merge_rules.yaml` - Removing folks no longer working on the project. 🫡 Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD	2024-06-16 04:52:16 +00:00
Zheng, Zhaoqiong	a2d9c430b4	Adding a note for Getting Started with PyTorch on Intel GPUs (#127872 ) Adding a note for Getting Started with PyTorch on Intel GPUs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872 Approved by: https://github.com/svekars	2024-06-14 14:24:28 +00:00
PyTorch MergeBot	6895a5804c	Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795 )" This reverts commit `c472cec565`. Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))	2024-06-14 01:14:59 +00:00
Jing Xu	8763d44bf1	add xpu to torch.compile (#127279 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279 Approved by: https://github.com/dvrogozh, https://github.com/svekars	2024-06-13 21:15:09 +00:00
Jing Xu	7fe9ab9ccc	update amp example to device-agnostic (#127278 ) As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic. Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars	2024-06-13 02:01:16 +00:00
soulitzer	c472cec565	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. "bc-breaking" for existing users of the private API: - Existing policy functions must now change their return value to use the Enum. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-12 23:57:33 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `4c971932e8`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Jing Xu	205410cb44	add xpu to torch.tensors (#127280 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280 Approved by: https://github.com/svekars	2024-06-11 18:13:01 +00:00
Ke Wen	fe39c07826	[pipelining][doc] Remove duplicated words (#128368 ) "for execution" is used in both step titles Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368 Approved by: https://github.com/wconstab ghstack dependencies: #128361	2024-06-11 04:52:57 +00:00
Ke Wen	4077cdd589	[pipelining][doc] Update arg list of pipeline API (#128361 ) And document the use of `build_stage` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128361 Approved by: https://github.com/wconstab	2024-06-11 02:55:17 +00:00
Jun Luo	f843ccbb1a	[MTIA] Add set_device support (#128040 ) Summary: Support set_device API in MTIA backend. Reviewed By: gnahzg Differential Revision: D58089498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040 Approved by: https://github.com/gnahzg	2024-06-10 23:42:52 +00:00
loganthomas	583a56d5a8	DOC: add docstring to construct_and_record_rdzv_event() (#128189 ) Fixes #127902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189 Approved by: https://github.com/kurman	2024-06-10 22:17:33 +00:00
Shuqiang Zhang	c7e2c9c37e	[c10d][doc] add a doc page for NCCL ENVs (#128235 ) Addressing issue: https://github.com/pytorch/pytorch/issues/128204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128235 Approved by: https://github.com/wconstab	2024-06-09 16:08:38 +00:00
eqy	4c971932e8	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-09 06:53:34 +00:00
Ke Wen	613c7d270d	[pipelining] Format doc (#128279 ) - Should use two dots around `var` - Wrap lines - Add section cross ref Pull Request resolved: https://github.com/pytorch/pytorch/pull/128279 Approved by: https://github.com/H-Huang ghstack dependencies: #128273, #128278	2024-06-08 04:59:04 +00:00
Ke Wen	2e42671619	[pipelining] Rename to stage.py and schedules.py (#128278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128278 Approved by: https://github.com/H-Huang ghstack dependencies: #128273	2024-06-08 04:42:35 +00:00
Ke Wen	0e3fe694d1	[pipelining] Restore a stage constructor for tracer path (#128273 ) In case user modified stage module out of place, such as mod = DDP(mod) mod = torch.compile(mod) They need a stage builder else than `pipe.build_stage()`. This PR provides an API to do so: ``` def build_stage( stage_module, stage_index, pipe.info(), ... ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273 Approved by: https://github.com/wconstab	2024-06-08 04:42:35 +00:00
Will Constable	f9508b4c1f	[pipelining] Update Pipelining Docs (#128236 ) ---- - Bring PipelineStage/Schedule more front-and-center - provide details on how to manually construct PipelineStage - move tracer example and manual example below so the high-level flow (e2e) is closer to the top Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236 Approved by: https://github.com/H-Huang ghstack dependencies: #128201, #128228	2024-06-08 02:03:46 +00:00
Ke Wen	ad96f991a5	[pipelining] Add pipe.build_stage() (#128240 ) Given `PipelineStage` name to manual side. Thus adding a method under `Pipe` to create PipelineStage. Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-08 01:26:02 +00:00
Howard Huang	bef586111a	[pipelining] pipelining.rst updates (#128228 ) fix some nits and add `PipelineStage` (manual) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128228 Approved by: https://github.com/wconstab ghstack dependencies: #128201	2024-06-07 23:29:54 +00:00
Ke Wen	3090667cf9	[pipelining] pipeline() taking microbatch as example input (#128163 ) Changed the API of `pipeline()` to take microbatch instead of full batch as example args. Main purpose is to: - make this API more atomic; - decouple tracing frontend from runtime info like `num_chunks`. Side effects: - Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object. - User has to create example microbatch input. - Chunk spec stuff are now all moved to runtime side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163 Approved by: https://github.com/H-Huang	2024-06-07 15:51:53 +00:00
Howard Huang	543a870943	[pipelining] Rename ManualPipelineStage -> PipelineStage (#128157 ) Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157 Approved by: https://github.com/wconstab	2024-06-07 09:24:16 +00:00
chunyuan	7efaeb1494	[AOTI] docs: add suggestion to turn on freezing on CPU (#128010 ) With https://github.com/pytorch/pytorch/pull/124350 landed, it is now suggested in AOTI to turn on freezing on CPU to get better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128010 Approved by: https://github.com/desertfire	2024-06-07 08:57:02 +00:00
Ke Wen	01601ebd41	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-07 08:11:58 +00:00
Ke Wen	96806b1777	[pipelining][doc] Add frontend description and change tracer example (#128070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128070 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-07 04:09:36 +00:00
Pian Pawakapan	50155e825b	[export] provide refine function for automatically accepting dynamic shapes suggested fixes (#127436 ) Summary: Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes. Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used. Example usage for the automatic dynamic shapes workflow: ``` # export, fail, parse & refine suggested fixes, re-export try: export(model, inps, dynamic_shapes=dynamic_shapes) except torch._dynamo.exc.UserError as exc: new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes) export(model, inps, dynamic_shapes=new_shapes) ``` For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅 Test Plan: test_export tests Differential Revision: D57409142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436 Approved by: https://github.com/avikchaudhuri	2024-06-07 03:29:06 +00:00
brightonanc	6dfdce92ba	Fixed typos in the complex numbers portion of the autograd docs (#127948 ) This PR fixes several typos in the complex numbers section of the docs for autograd. Only documentation was altered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127948 Approved by: https://github.com/soulitzer	2024-06-06 22:47:04 +00:00
ibartol	bb2de3b101	Fixed broken link and removed unfinished sentence from issue #126367 (#127938 ) Fixes #126367. ## Description Fixed a broken link in the pytorch/docs/source/torch.compiler_faq.rst doc and deleted a few words that were extra according to the issue tagged above. ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/127938 Approved by: https://github.com/msaroufim	2024-06-05 07:37:32 +00:00
Svetlana Karslioglu	20f966a8e0	Ignore undocumented PipelineSchedule.step (#127955 ) Ignore undocumented PipelineSchedule.step to fix doc build: https://github.com/pytorch/pytorch/actions/runs/9372492435/job/25805861083?pr=127938#step:11:1284 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127955 Approved by: https://github.com/kit1980	2024-06-04 22:11:09 +00:00
Tristan Rice	597922ba21	Reapply "distributed debug handlers (#126601 )" (#127805 ) This reverts commit `7646825c3e`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805 Approved by: https://github.com/PaliC	2024-06-04 19:44:30 +00:00
PyTorch MergeBot	0ff60236ab	Revert "Retire torch.distributed.pipeline (#127354 )" This reverts commit `b9c058c203`. Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit `b9c058c203` ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))	2024-06-04 18:19:31 +00:00
Ke Wen	b9c058c203	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-04 07:03:26 +00:00
Jeff Daily	0e7bd7fedd	[ROCm] TunableOp improvements (#124362 ) - use less memory; smaller default hipblaslt workspace size - options to avoid cache effects - icache flush option - rotating buffers during tuning - python APIs - unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362 Approved by: https://github.com/xw285cornell	2024-06-03 22:30:11 +00:00
Sheng Fu	c1dd3a615f	Implement Graph Transform Observer (#127427 ) Summary: Implement Graph Transform Observer Differential Revision: D57887518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427 Approved by: https://github.com/angelayi	2024-06-02 06:49:47 +00:00
PyTorch MergeBot	7646825c3e	Revert "distributed debug handlers (#126601 )" This reverts commit `3d541835d5`. Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))	2024-05-31 01:21:24 +00:00
Alex Baden	5d316c81be	[Inductor] Add 0 initialization to Triton masked loads (#127311 ) For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur. Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized. Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called. Fixes #126535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311 Approved by: https://github.com/jansel	2024-05-30 04:50:54 +00:00
Tristan Rice	3d541835d5	distributed debug handlers (#126601 ) This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-05-30 02:21:08 +00:00
rzou	1abcac9dab	New Custom Ops Documentation landing page (#127400 ) We create a new landing page for PyTorch custom ops (suggested by jansel). All of our error messages will link here, and I'll work with the docs team to see if we can boost SEO for this page. NB: the landing page links some non-searchable webpages. Two of those (the Python custom ops tutorial and C++ custom ops tutorial) will turn into actual webpages when PyTorch 2.4 comes around. I'll make the third one (the Custom Operators Manual) once it stabilizes (we continously add new things to it and the length means that we might want to create a custom website for it to make the presentation more ingestable). Test Plan: - view docs preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127400 Approved by: https://github.com/jansel ghstack dependencies: #127291, #127292	2024-05-30 01:06:04 +00:00
Edward Z. Yang	76fc58c160	Document the legacy constructor for Tensor (#122625 ) Fixes https://github.com/pytorch/pytorch/issues/122408 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122625 Approved by: https://github.com/albanD	2024-05-29 23:23:19 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit `7763c83af6`. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
Xuehai Pan	35ea5c6b22	[3/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torchgen (#127124 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122, #127123	2024-05-25 19:20:03 +00:00
Yu, Guangye	e7a42702f9	generalize custom_fwd&custom_bwd to be device-agnostic (#126531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #126527	2024-05-25 06:48:16 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
lezcano	a30baec0c3	[Docs] Fix NumPy + backward example (#126872 ) We were calling backward on a tensor not a scalar... Pull Request resolved: https://github.com/pytorch/pytorch/pull/126872 Approved by: https://github.com/albanD	2024-05-22 21:29:31 +00:00
Kurman Karabukaev	d62b025efc	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k	2024-05-22 18:24:11 +00:00
Ke Wen	403012b50a	[pipelining] expose APIs per pytorch rule (#126812 ) Rule is enforced by #126103. The rule: - If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`. - `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported. - All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812 Approved by: https://github.com/wconstab	2024-05-22 16:21:13 +00:00
Sahdev Zala	fe0a36fd7c	Fix a link in the compiler backend doc (#126079 ) The core aten is the core subset of aten and seems the corrent link to replace the broken link. Fixes #125961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126079 Approved by: https://github.com/svekars	2024-05-21 20:16:04 +00:00
Joel Schlosser	31ba6ee49b	Traceable wrapper subclass support for deferred runtime asserts (#126198 ) The padded dense -> jagged conversion op has the signature: ``` _fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor ``` when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198 Approved by: https://github.com/ezyang	2024-05-21 01:21:46 +00:00
Mikayla Gawarecki	66dc8fb7ff	Allow tensor subclasses and add `torch.serialization.add_safe_globals` that allows users to allowlist classes for `weights_only` load (#124331 ) #### Conditions for allowlisting tensor subclasses We allow tensor subclasses types that (1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`) (2) Use the generic `tp_alloc` (3) Are in a module that has been imported by the user to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2` Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution. The rationale for the 3 conditions above is as follows: The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`) `4e66aaa010/torch/_tensor.py (L57-L71)` `as_subclass` is implemented with a call to `THPVariable_NewWithVar` that will eventually call `tp_alloc` here `4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)` The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc` Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling* ### How do we check something is a tensor subclass/constraints around imports In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We do not arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)` This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`) ### API for allow listing This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe). Next steps: - Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331 Approved by: https://github.com/albanD	2024-05-17 17:56:57 +00:00
yuanx749	691af57fbc	Fix broken link of scikit-learn (#120972 ) The link is broken in https://pytorch.org/docs/main/community/design.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/120972 Approved by: https://github.com/Skylion007	2024-05-16 11:46:34 +00:00
Edward Z. Yang	44efeac24e	Beef up error message for pending assert failure (#126212 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126212 Approved by: https://github.com/Skylion007	2024-05-15 18:22:53 +00:00
Oguz Ulgen	79655a1321	Add force_disable_caches to the docs (#126184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126184 Approved by: https://github.com/msaroufim	2024-05-15 07:16:08 +00:00
Ke Wen	07d6ab5aa2	[pipelining] Add pipeline schedules (#125975 ) 1. Add pipeline schedules: - GPipe - 1F1B - Interleaved 1F1B - LoopedBFS 2. Add basic forward and backward tests: test_schedule.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/125975 Approved by: https://github.com/wconstab ghstack dependencies: #125729	2024-05-11 21:17:53 +00:00
Will Constable	26b942c4fc	[C10D] Document destroy_process_group usage (#122358 ) This API was not documented. It has already been a source of confusion, but recently has become more urgent as improper destruction can lead to hangs due to ncclCommAbort's requirement of being called collectively. <img width="888" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/9e16342d-1108-4d7d-95c8-b8753661b8e9"> Fixes #48203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122358 Approved by: https://github.com/shuqiangzhang	2024-05-09 16:51:31 +00:00
lezcano	acafabaa29	Rename TorchDynamo -> Dyanamo in the dynamo tutorial doc (#123431 ) Less verbose and it aligns it with the dynamo deepdive Pull Request resolved: https://github.com/pytorch/pytorch/pull/123431 Approved by: https://github.com/peterbell10	2024-05-07 05:07:00 +00:00
albanD	76a26a885d	Add module tracker (#125352 ) This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack. If it is too problematic to review, I can re-split, just let me know. This does: - Cleanup context manager use in test_flop_counter - Remove need for mod argument in FlopCounterMode, warning about it - Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook). - Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352 Approved by: https://github.com/mikaylagawarecki	2024-05-04 18:33:35 +00:00
Ke Wen	5cd7c75bd9	[pipelining] Add tracing frontend (#125448 ) This PR allows user to transform a model into a pipeline representation with split stages, according to a split spec. ``` def pipeline( module: torch.nn.Module, num_chunks: int, example_args: Tuple[Any, ...], example_kwargs: Optional[Dict[str, Any]] = None, split_spec: Optional[Dict[str, SplitPoint]] = None, split_policy: Optional[Callable[[fx.GraphModule], fx.GraphModule]] = None, ) -> Pipe: ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125448 Approved by: https://github.com/H-Huang ghstack dependencies: #125273	2024-05-04 09:00:25 +00:00
Muralidhar Andoorveedu	b96b1e8cff	[Distributed] Add P2P versions of *object_list operations (#124379 ) This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream. With this change, sending and receiving arbitrary picklable python objects is possible. Relevant issue: https://github.com/pytorch/pytorch/issues/3473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-05-03 23:22:58 +00:00
Alexandre Ghelfi, PhD	d18a6f46d0	Adding Compare in torch.utils.benchmark documentation (#125009 ) `torch.utils.benchmark.Compare` is not directly exposed in torch.utils.benchmark documentation. I think this is a valuable resource to add since it can help people embracing the torch benchmark way of doing things, and help people building documentation towards it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125009 Approved by: https://github.com/mikaylagawarecki	2024-05-03 00:50:54 +00:00
Ke Wen	0199ce8d6c	[pipelining] Add microbatch split and merge utils (#125273 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125273 Approved by: https://github.com/H-Huang ghstack dependencies: #124776, #124875, #124958	2024-05-02 21:09:47 +00:00
Lucas Pasqualin	799f1460af	[DCP] Provides default AsyncStager (#124939 ) Differential Revision: [D56575987](https://our.internmc.facebook.com/intern/diff/D56575987/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124939 Approved by: https://github.com/fegin ghstack dependencies: #122965	2024-05-02 19:48:54 +00:00
Lucas Pasqualin	3741fb3680	[DCP] Introduce async staging extension points (#122965 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * #124944 * #124939 * __->__ #122965 Differential Revision: [D55493240](https://our.internmc.facebook.com/intern/diff/D55493240/) This PR is now ready for merge and is not an RFC Major choices are: -- the introduction of the AsyncStager protocol -- removed `executor` from param. -- leave async as a separate method (for now) This proposal seeks to add extension points to dcp.async_save, allowing users to: - Specify a specific staging method when calling async_save - Allow a vehicle for also making the staging method async, to allow for cases where we may want to overlap with the training loop (e.g., overlap d2h with and only synchronize at the optim.step) - Potentially specify the execution method for doing async_save in parallel. For example some users may prefer a subprocess over a thread to avoid GIL issues. A totally reasonable alternative to this entire proposal is to expect users who want this level of customization to write their own custom async save methods. Here's an example which addresses the issues mentioned in PR comments. ``` def custom_async_save(...): # this step accomplishes staging and includes the usual 'planning' calls (issue 1) buffered_writer = CpuBufferedWriter() # this is stateful, contains a copy of state_dict dcp.save(state_dict, storage_writer=buffered_writer) final_storage_writer = FileSystemWriter() mp.spawn( # issue2 is gone, do whatever you want here dcp.save, # or some custom sub-process method which calls dcp.save under the hood buffered_writer.state_dict, # lot's of way's to do this, not really the most important part checkpoint_id=checkpoint_id, storage_writer=storage_writer, planner=planner, process_group=process_group, # this actually wouldn't work, but again not the pt. ) # leaving out the rest of the details for managing your extra special subprocess. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122965 Approved by: https://github.com/daulet-askarov	2024-05-02 19:01:55 +00:00
Ke Wen	52142192d4	[pipelining] Add stage backward function (#124958 ) This is a helper function which: 1. computes the gradients for the stage inputs, and 2. accumulates gradients for the stage module's parameters. A unit test for this function is also added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124958 Approved by: https://github.com/wconstab ghstack dependencies: #124776, #124875	2024-05-01 07:56:58 +00:00
Mikayla Gawarecki	2480e8b8a1	Add MAP_SHARED option for torch.load(mmap=True) (#124889 ) Fixes #124528 Going over the options for our MapAllocator and what they do, I don't think any other of them need to be piped up to `torch.load` `4f29103749/aten/src/ATen/MapAllocator.h (L8-L16)` ~However, I wonder if this `MmapVisibility(Enum)` is a good way to represent "or-ing" together of `mmap` flags if we want to extend it in the future. I looked over the flags for [`mmap(2)`](https://man7.org/linux/man-pages/man2/mmap.2.html), and could not immediately see how most of them would be useful for `torch.load` (would maybe `MAP_LOCKED` (like `mlock`) or `MAP_HUGE` ever be worthwhile?)~ Using the flags provided by the python `mmap` library so that we can extend the allowed flags and pipe them down to the cpp `mmap` call if there is a need for other flags in the future Pull Request resolved: https://github.com/pytorch/pytorch/pull/124889 Approved by: https://github.com/albanD	2024-04-30 15:02:19 +00:00
Avik Chaudhuri	e7846447e0	dynamic shapes builder API (#124898 ) This PR introduces a new way of building `dynamic_shapes` for export. The idea is to build up a mapping from input tensors to the dynamic shapes that should be assigned to their corresponding fake tensors. This mapping is automatically converted to the current form of `dynamic_shapes`, which must exactly match the structure of inputs. We do this by using pytree utils. With the current `dynamic_shapes`, we had to be careful about user-defined classes that are registered with pytree, since such classes are not necessarily polymorphic containers; they may be fine containing tensors, but not dynamic shapes. Thus we had decided to allow input instances of such classes to be associated with dynamic shapes in flattened form. This decision needs to be mirrored in this PR as well. To make it easier to keep these code paths in sync, we refactor the current recursive procedure for associating inputs with dynamic shapes to use the same pytree utils. This needs minor fixes to a few tests where `dynamic_shapes` were not exactly matching the structure of inputs. Differential Revision: D56551992 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124898 Approved by: https://github.com/zhxchen17	2024-04-30 03:59:49 +00:00
Tristan Rice	dc4c75ba72	elastic/rendezvous: make barrier and rank assignment operations O(n) instead of O(n^2) (#124982 ) Summary: This makes barrier and rank operations linear instead of quadratic with the number of workers. This drastically improves performance for rendezvous when running with over 1000 hosts. This uses 2 approaches for different areas: * local rank assignment: each worker does 1 set and 1 get, local ranks are assigned on the rank 0 host in a O(n) operation which reduces total store operations to be linear with number of workers. * exit_barrier: use a counter and a final flag so each worker has to do max 1 set, 1 get and 1 add. At 4000 hosts we see torchelastic be able to run in as little as 10 seconds down from 373 seconds. Test Plan: This is testing using many small tests running on a remote cluster. {D56549942} ``` torchx run --scheduler mast -- --image=torchelastic_benchmark --j=4000x1 ``` Differential Revision: D56605193 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124982 Approved by: https://github.com/kiukchung, https://github.com/kurman	2024-04-27 02:21:44 +00:00
egienvalue	73744a2c00	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-26 16:17:54 +00:00
Yu, Guangye	19a83eacb5	add new API torch.amp.is_autocast_available (#124938 ) # Motivation expose `torch._is_autocast_available` to `torch.amp.is_autocast_available` as a public api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124938 Approved by: https://github.com/albanD	2024-04-26 08:45:20 +00:00
PyTorch MergeBot	e04c7b19f4	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit `381653de63`. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))	2024-04-25 16:06:46 +00:00
Edward Z. Yang	b4597fffce	Try to reuse old symbol name rather than new symbol name when renaming (#124782 ) Previously, unbacked SymInts would gradually get larger and larger as we kept rebinding them. Now, we do the replacement to preserve the old symbol. Actually doing this is a bit tricky. Here’s the order things happen when retracing data dependent: 1. Run fake tensor prop: allocate new unbacked SymInt 2. Run proxy tensor mode, calculate bindings and associate them with FX node 3. Run PropagateUnbackedSymInts, rename unbacked bindings to their old ones so they are consistent So the problem is when we calculate bindings in step (2), we don't know what the original names are yet, we only find out later at (3). But by the time (3) runs, we've already stuffed some new bindings in meta["unbacked_bindings"] and we don't know how to update them! To fix this, I introduce resolve_unbacked_bindings which post facto applies any of the renamings we discovered in (3). Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124782 Approved by: https://github.com/lezcano ghstack dependencies: #124310, #124314, #124316, #124394, #124739	2024-04-25 14:02:42 +00:00
Edward Z. Yang	13ab24f192	Reimplement unbacked symbol bindings in Inductor (#124394 ) This PR has a lot of "draw the rest of the fucking owl" energy. Here's how to break it down. 1. torch/_inductor/graph.py - We start by tightening unbacked symbol invariants. Specifically, as we lower FX nodes, we check whether or not every unbacked_binding recorded on the FX node meta, actually ends up getting bound (according to get_unbacked_symbol_defs) in all the buffers generated by the lowering. Hopefully this invariant is self evident. This leads to a lot of failures. 2. torch/_inductor/ir.py - Problem 1: There is softness in how Inductor computes defs of unbacked symbols in IR node. Previously, we tried to infer it by looking at the output sizes/strides/etc and see if new unbacked symbols popped up that we hadn't seen in the inputs. I don't know exactly what was buggy about the old code, but sometimes we would fail to notice an unbacked symbol had been bound, or rebind an unbacked symbol multiple times. Fortunately, thanks to the earlier PRs in our stack, we now have a nice list of unbacked symbol bindings from FX, so we now just store it directly on ExternKernel and use it directly to report defs. This has to be done twice: once for FallbackKernel (e.g., nonzero) and once for DynamicScalar (e.g., item) (see also torch/_inductor/lowering.py, torch/_inductor/codegen/wrapper.py and torch/_inductor/codegen/cpp_wrapper_cpu.py for the lowering and codegen changes for item) * process_kernel - Sidequest! It turns out that Inductor lowering can reallocate unbacked symbols. This happens specifically when we repropagate fake tensors through the operator in `process_kernel`. This repropagation process is necessary because Inductor may have changed the strides of input tensors, and it must now recompute the strides so that it can continue to appropriately plan the rest of the lowering process. This is fine: we just make sure we do the rebind unbacked + compute_unbacked_bindings dance we've been doing previously in the PR stack. But instead of putting unbacked_bindings on a new FX node, they go straight into our unbacked_bindings on the Inductor IR node. * codegen_unbacked_symbol_defs - Sidequest! FallbackKernel lowering is done in two steps. First, you emit the FallbackKernel buffer. Then, you emit MultiOutput buffers which actually give access to the individual outputs of FallbackKernel, which may have been multi-output. There is a design decision here: does the FallbackKernel bind the unbacked symbols, or the MultiOutput buffer? Historically, we put the binding on MultiOutput buffer, because it's more convenient: the FallbackKernel buffer is fake, in fact, it doesn't even get a name in C++ codegen. But it's kind of inconsistent with the keypath model that we've been tracking unbacked bindings with: if you have a multi-output node, you'd expect a keypath like `[0].size()[0]` representing the first output's first dimension size. That suggests that it's the FallbackKernel that should define the things. So that was my first implementation. Unfortunately, the C++ codegen is too cursed and I could not understand how to make it work in that case. So now we just unsoundly assume you cannot have multi-output data dependent output, and do the codegen in MultiOutput. There are some comments explaining exactly what we are improperly assuming. 3. _rename_unbacked_to in torch/fx/experimental/symbolic_shapes.py - Previously, when we renamed unbacked symbols, we clobbered any facts we previously knew about them. So for example, if we had a replacement `u0 -> s0` but then we renamed u0 to u1, we would now setup the replacement `u0 -> u1`, clobbering the old replacement. This apparently didn't matter in earlier PRs in the stack, but with Inductor now on the ball, there were some tests that indicated this was a problem. The solution is easy: if u0 had a preexisting replacement, reapply it to u1. However... * torch/_functorch/_aot_autograd/collect_metadata_analysis.py - When we run forward analysis, this triggers fake tensor repropagation and fresh allocations. Previously, we just cleared out the pending symbols when finished the analysis. But with the change above, this would also migrate replacements to the new symbols... which are now dead. So now we explicitly suppress generation of these symbols with `ignore_fresh_unbacked_symbols` so that no rebinding happens at all. * torch/_dynamo/eval_frame.py - same deal; I just searched for all sites we called clear() on pending 4. The last step is fixing the long tail of extra problems that show up, now that unbacked_bindings are load bearing into Inductor * torch/_dynamo/eval_frame.py - Some of the exports are making copies of nodes without repropagating fake tensors, so in this case, it is important to also copy the `unbacked_bindings` (apparently this didn't matter before without the Inductor changes) * torch/_export/pass_base.py - I discover that this is doing fake tensor repropagation via a test suite failure. Do the same playbook as AOTAutograd: PropagateUnbackedSymInts too! Actually, they also have implemented their own tracer as well, so do the same playbook as proxy_tensor: record unbacked_bindings on the newly traced nodes. UGH code duplication. * torch/_subclasses/fake_tensor.py, torch/_subclasses/fake_impls.py (with call site updates at torch/_functorch/_aot_autograd/traced_function_transforms.py and torch/fx/passes/fake_tensor_prop.py) - What's this new epoch thing? I noticed that sometimes I would be retracing, call nonzero() on a fake tensor, and not allocate a new unbacked symbol. This is actually bad, because if I don't get a new unbacked symbol, I don't know there's a binding site, and `unbacked_bindings` is now missing a binding. The reason for this is memoization: if I reuse the exact same fake tensor on my retrace, it will already have an unbacked symint memoized on it and we will short circuit allocation. Well, that's no good. So I associate the memos with a fake tensor epoch, and every time you start a new fake tensor propagation from scratch, you bump the epoch so that I clear all the memos. * torch/_inductor/scheduler.py - I notice in unit tests that V.current_node is not always set when we call process_kernel. So I save it into the IR node and restore it when we are running `get_estimated_runtime`. * torch/fx/experimental/symbolic_shapes.py - A few things * rebind_unbacked (re _tensor_version). Ordinarily, when you have an unbacked SymInt, you persistently hvae it all the way to the end of the program. `_tensor_version` violates this: this generates an unbacked SymInt (for reasons I don't quite understand?) and then gets rid of it later. This triggered an assert violation. I think this op is kind of misusing unbacked SymInt, but I didn't know how to refactor it, so it gets a special case. * rebind_unbacked (re Simplify SymBool binding). Ugh, SymBool, what a pain in the butt. I have an assert that you can only rebind unbacked symbol to another unbacked symbol. This assert fails when a boolean is involved, because the result of running keypath on the result is not `u1`, it's `sympy.Piecewise(... sympy.Eq(u1, 1) ...)`. This is actually just `u1`, but Sympy doesn't know it because it doesn't know that `u1` value range is `[0, 1]`. So we manually implement the simplification needed to get the assert to pass. * compute_unbacked_bindings (re This is pretty fragile). There is a really funny disaster involving memoization and Inductor process kernel. Ordinarily when I retrace, if there was a memo hit in the old trace, there will be a memo hit in the new trace. However, Inductor process kernel breaks this, because it recreates fake tensor inputs to the operator call from scratch (since they might have different strides), and obviously these tensor inputs don't have the memo from the old one. I tried a little bit to try to manually transplant the memo to the new fake tensor but it seemed hopeless, so I just let the fresh symbol ride, allocating a new unbacked symbol. However, in one of our tests, we rely on knowing that the first nonzero call is equal to the second (memoized) nonzero call. The equality test looked pretty easy to discharge, so I just went ahead and added a deferred runtime assert to this effect and it worked. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124394 Approved by: https://github.com/jansel ghstack dependencies: #124310, #124314, #124316	2024-04-25 02:08:59 +00:00
Edward Z. Yang	9692b954c6	FakeTensorProp works with unbacked bindings (#124310 ) This is a partial revert of https://github.com/pytorch/pytorch/pull/124059 Like in #124297, profiling has revealed that testing equality on every output is kind of expensive. So we only test equality when we know there is an unbacked binding. This is the same playbook as the previous PR, just on FakeTensorProp instead of PropagateUnbackedSymInts. Note that we also need to populate `unbacked_bindings` in proxy_tensor.py, since we're generating an entirely new graph in that case. We now have enough propagation that we're able to trigger a bug related to divisibility replacement. In https://github.com/pytorch/pytorch/pull/113165 we allowed to replace `u0` with `u1 * c` for some constant c, when we have determined that u0 is divisible by c. However, where does the binding for u1 come from? What we will have in practice is that there is some node that is supposed to have bound u1, but which actually is getting a `u1 * c` in its output. So, to get u1, we must divide out c. Fortunately, under the divisibility condition, this is always possible (but remember, we must test divisibility at runtime!) Because we have tightened up asserts, it is now an error to allocate unbacked SymInts and then fail to track them under unbacked_bindings. In torch/_dynamo/eval_frame.py and torch/_functorch/_aot_autograd/collect_metadata_analysis.py there are examples of benign cases where we repropagated fake tensors but then immediately threw away the results. In these cases, it's not appropriate to rebind, since we're still using the old FX graph that has all of the old symbols. So we just manually clear it. It is possible that other cases will need to be updated, so this PR is "risky" from the perspective of hitting fbcode. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124310 Approved by: https://github.com/lezcano	2024-04-25 02:08:51 +00:00
Gagan Jain	c5e567c573	[Torch][Timer] Adding debug info logging interface for expired timers (#123883 ) Summary: Adding function to log additional debug information before killing the expired watchdog timers. Additional information like stack trace can be added in the debug function using worker process IDs from expired timers. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/timer:file_based_timer_test Differential Revision: D56044153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123883 Approved by: https://github.com/kurman	2024-04-25 01:15:52 +00:00
egienvalue	381653de63	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-24 20:51:20 +00:00
Edward Z. Yang	e0e2d897ed	Handle Tensor returns in PropagateUnbackedSymInts (#124297 ) This subsumes https://github.com/pytorch/pytorch/pull/124069 In the original PR, my idea was that when we run PropagateUnbackedSymInts, we check that the sizes before and after are exactly the same. This ended up turning up lots of bugs that I didn't feel like fixing. Separately, Ivan let me know that this pass was quite expensive in terms of compile time, since we spent a lot of time thinking about the equalities. To kill two birds with one stone, we now only check for equality precisely when an unbacked SymInt was bound (thanks to the previous PR in this stack, we now have this information). Specifically, we look to see if `meta["unbacked_bindings"]` is set on the old node, and if it is, we assert the old value is equal to the new value from the repropagation. Note that the pytree key is used to actually extract the new value from the example value, as it may be nested inside an, e.g., tensor size. We do something a bit naughty at the end: we use `defer_runtime_assert` to actually teach ShapeEnv about the equality. This is implementationally equivalent to what we used to do, but we're going to change this later soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124297 Approved by: https://github.com/lezcano ghstack dependencies: #124290	2024-04-24 12:18:33 +00:00
Edward Z. Yang	b04dca1502	Add pending_fresh_unbacked_symbols, populate unbacked_bindings for Dynamo (#124290 ) The important comment: ``` # Whenever we allocate a fresh unbacked Symbol, we add it to this # pending list. Unbacked symbol allocation can occur at unpredictable # points during meta tensor propagation, but at some point, the we # have to know what the binding site for an unbacked symbol is, and # this is computed when we actually place the node in the graph. The # important thing is that we always actually handle every unaccounted # for unbacked symbol, so this list helps us keep track of them and # then make sure they are all accounted for. # # We could potentially give rise to errors earlier by lexically # scoping when we do propagation, and only allowing unbacked symbols # to be allocated at this point in time. However this is inconvenient # to do in Dynamo, because fake tensor propagation is far from when we # analyze binding sites (set_example_value), so we do it in a more # mutatey way. # # NB: fresh unbacked symbols NEVER get substitutions applied to them, # they are binding sites! ``` The compute_unbacked_bindings is the other half of the equation: the thing that actually consumes the pending_fresh_unbacked_symbols and does something with them. Important comment: ``` After having run fake tensor propagation and producing example_value result, traverse example_value looking for freshly bound unbacked symbols and record their paths for later. It is an error if we have allocated an unbacked SymInt but it cannot be found in example_value. (NB: this means if you have a multi-output function, you must call this on the tuple of tensor output, you cannot wait!) ``` For example, if I return a tensor with size `[u0, u1]`, and u1 is a fresh unbacked SymInt, then I'll have `{u1: KeyPath(".size(1)")}`, telling me I can get u1 by running `size(1)` on the result of this node. u0 is not fresh (it probably flowed in as an argument), so I don't generate a binding for it. I eventually intend to propagate this information all the way to Inductor lowering, where extra metadata about unbacked symbol binding will be canonically used for codegen, instead of trying to infer it from defs/uses. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124290 Approved by: https://github.com/lezcano	2024-04-24 09:11:34 +00:00
rzou	4ceb44c40d	Add torch.library.opcheck (#124496 ) This PR: - exposes torch.testing._internal.optests.opcheck as torch.library.opcheck - Adds support for CustomOpDef (aka functions decorated with torch.library.custom_op) to opcheck. Test Plan: - Updated tests - We validated opcheck's design internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124496 Approved by: https://github.com/williamwen42	2024-04-23 21:48:00 +00:00
Matthew Hoffman	1d3a13d3d1	Conform torch.mps to device module interface (#124676 ) Right now `torch.fork_rng()` doesn't support MPS. MPS' device module functions don't line up with the others'. There is a step of `fork_rng` to call `device_count()`: `302d7e9a6e/torch/random.py (L146)` It is pretty simple to know the MPS device count, based on whether it is built and available. Also: `302d7e9a6e/torch/random.py (L168)` `302d7e9a6e/torch/random.py (L175)` `get_rng_state` and `set_rng_state` are expected to be able to accept a `device` parameter. @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/124676 Approved by: https://github.com/ezyang	2024-04-23 18:38:48 +00:00
Jeff Daily	6ede882c0b	preferred blas library; cublaslt gemm implementation (#122106 ) Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources. The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106 Approved by: https://github.com/lezcano	2024-04-22 15:38:22 +00:00
PyTorch MergeBot	929242a15c	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit `d7e1bf9ff9`. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
rzou	bad8d25881	Add torch.library.register_kernel (#124299 ) This mirrors the .register_kernel method on the object produced by the custom_op decorator. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124299 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200	2024-04-19 13:54:21 +00:00
Tobias Ringwald	58e403c739	Added a docstring for torch.Size.numel. (#124186 ) Fixes #61231. Fixes #124167. This PR documents a rather long-standing issue w.r.t. unexpected behavior of `torch.Size.numel`, first reported almost 5 years ago. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124186 Approved by: https://github.com/janeyx99	2024-04-19 09:23:02 +00:00
egienvalue	d7e1bf9ff9	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- @exported-using-ghexport Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-18 17:38:06 +00:00
Boyuan Feng	aa2da0cdd2	[Export] Add runtime assert to non-strict export (#123681 ) This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export. Differential Revision: D55944267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2024-04-18 16:13:27 +00:00
rzou	645173a0b5	Add torch.library.register_autograd (#124071 ) Allows registering autograd for all custom op entry points: - the new-style custom op API (custom_op) - the old-style torch.library APIs - C++ operator registration Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066	2024-04-18 12:47:59 +00:00
doloresgarcia	4efdf9a6a6	fix pytorch version for onnx in doc (#124182 ) Fixes [ 123845](https://github.com/pytorch/pytorch/issues/123845) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124182 Approved by: https://github.com/albanD	2024-04-17 18:05:15 +00:00
rzou	47dbfecd37	Rename impl_abstract to register_fake, part 1/2 (#123937 ) This PR: - adds a new torch.library.register_fake and deprecates torch.library.impl_abstract. The motivation is that we have a lot of confusion around the naming so we are going to align the naming with the actual subsystem (FakeTensor). - renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation here yet; I need to test how this works with static initialization. - Renames a bunch of internals to match (e.g. abstractimplpystub -> pystub) I'm scared to rename the Python-side internal APIs (e.g. torch._library.abstract_impl) because of torch.package concerns. I'll do that in its own isolated PR next just in case it causes problems. DEPRECATION NOTE: torch.library.impl_abstract was renamed to to torch.library.register_fake. Please use register_fake. We'll delete impl_abstract in a future version of PyTorch. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937 Approved by: https://github.com/albanD	2024-04-17 12:46:01 +00:00
Edward Z. Yang	cebf65126c	FakeTensorProp assert consistency of sizes when metadata previously existed (#124059 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124059 Approved by: https://github.com/bdhirsh, https://github.com/thiagocrepaldi ghstack dependencies: #124105	2024-04-16 23:28:42 +00:00
lezcano	891736f115	Fix links rendering when surrounding code in Dynamo deepdive (#123427 ) I thought the RST was rendering correctly, but here we are. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123427 Approved by: https://github.com/peterbell10	2024-04-13 04:55:15 +00:00
Gagan Jain	016ca546aa	Adding health check server hook in torch elastic (#122750 ) (#123504 ) Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55837899 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123504 Approved by: https://github.com/kurman	2024-04-11 19:10:56 +00:00
Edward Z. Yang	bbcdd28409	Report LRU cache stats at end of program for symbolic shapes (#123724 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123724 Approved by: https://github.com/Chillee	2024-04-11 05:12:43 +00:00
PyTorch MergeBot	ecb2418dd6	Revert "Adding health check server hook in torch elastic (#122750 )" This reverts commit `61d431fab0`. Reverted https://github.com/pytorch/pytorch/pull/122750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122750#issuecomment-2041104931))	2024-04-06 14:31:07 +00:00
Gagan Jain	61d431fab0	Adding health check server hook in torch elastic (#122750 ) Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55108182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750 Approved by: https://github.com/kurman	2024-04-05 23:17:30 +00:00
Huy Do	f5b8c9b730	Ignore some known duplicated modules in doc build config script (#123425 ) This is a follow-up fix of https://github.com/pytorch/pytorch/pull/123244#discussion_r1552935150 as @clee2000 points out a better way to ignore those duplicated entries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123425 Approved by: https://github.com/clee2000	2024-04-05 21:12:14 +00:00
Lucas Pasqualin	de7edeea25	[DCP] DCP logger (#121352 ) Adds additional logging for improved observability in DCP. Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352 Approved by: https://github.com/wz337, https://github.com/fegin	2024-04-05 17:50:50 +00:00
Guilherme Leobas	c575e378ba	Update torch.compile_faq w.r.t to functorch (#122213 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122213 Approved by: https://github.com/zou3519 ghstack dependencies: #122211, #122212	2024-04-05 03:29:11 +00:00
Guilherme Leobas	84658d9c4f	Enable `capture_func_transforms` by default (#122211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211 Approved by: https://github.com/zou3519	2024-04-05 03:29:11 +00:00
Huy Do	3d20cc1332	Cleanup some duplicated placeholder py:module docs (#123244 ) Fixes https://github.com/pytorch/pytorch/issues/123068 Fixes https://github.com/pytorch/pytorch/issues/111256 While investigating the flaky doc build failure .w.r.t duplicated `torch.ao.quantization.quantize` docstring warning, i.e. https://github.com/pytorch/pytorch/actions/runs/8532187126/job/23376591356#step:10:1260, I discover an old but still open bug in Sphinx https://github.com/sphinx-doc/sphinx/issues/4459. These warnings have always been there, but they are hidden because we are using `-j auto` to build docs with multiple threads. It's just by chance that they start to surface now. The issue can be reproduced by removing `-j auto` from https://github.com/pytorch/pytorch/blob/main/docs/Makefile#L5 and run `make html` locally. Then, these warnings shows up consistently. As `make html` treats warnings as errors, they will fail the build. ``` ... /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize.py:docstring of torch.ao.quantization.quantize.quantize:1: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in quantization, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:docstring of torch.nn.parallel.data_parallel.data_parallel:1: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in nn, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/spectral_norm.py:docstring of torch.nn.utils.spectral_norm.spectral_norm:1: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in nn, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:docstring of torch.nn.utils.weight_norm.weight_norm:1: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in nn, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:579: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in generated/torch.nn.functional.torch.nn.parallel.data_parallel, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:594: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in generated/torch.nn.utils.spectral_norm, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:595: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in generated/torch.nn.utils.weight_norm, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/quantization.rst:1348: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in generated/torch.ao.quantization.quantize, use :noindex: for one of them ... ``` The fix is just to clean up those duplicated placeholder py:module docs, which were there because these modules didn't have any docs originally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123244 Approved by: https://github.com/andrewor14, https://github.com/malfet	2024-04-05 03:18:53 +00:00
rzou	44c0c0fc0f	Add torch.library.custom_op (#122344 ) This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will never peek into it) custom op. In this PR, you can specify backend impls and the abstract impl for this op. NB: most of this PR is docstrings, please don't be intimidated by the line count. There are a number of interesting features: - we infer the schema from type hints. In a followup I add the ability to manually specify a schema. - name inference. The user needs to manually specify an op name for now. In a followup we add the ability to automatically infer a name (this is a little tricky). - custom_op registrations can override each other. This makes them more pleasant to work with in environments like colab. - we require that the outputs of the custom_op do not alias any inputs or each other. We enforce this via a runtime check, but can relax this into an opcheck test if it really matters in the future. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-03 18:36:17 +00:00
lezcano	b27ee6548d	Add a Dynamo deepdive to documentation (#122305 ) This supersedes the previous `Guards Overview" as a more comprehensive approach to most of the main topics within Dynamo. In the future, we could add specific sections for each of the topics discussed here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305 Approved by: https://github.com/msaroufim	2024-04-02 15:08:08 +00:00
Will Feng	489f4a063b	Revert "Preserve unbacked SymInt on SymNode (#120816 )" (#122988 ) This reverts commit `476585b190`. I did a bisect and this seems to be the cause of compile time regression in cudagraphs_dynamic test suite between 03/23 and 03/24: ![image](https://github.com/pytorch/pytorch/assets/4063635/21394e06-4906-4690-b5a2-7d16cc475843) image Particularly BERT_pytorch and hf_T5 seem to have ~50% compile time regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122988 Approved by: https://github.com/eellison	2024-04-01 22:11:09 +00:00
Mikayla Gawarecki	487b6d40ec	Add RMSNorm module (#121364 ) Similar to `dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)` The implementation here is not optimized and we welcome pull requests to improve this - Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation - Remove the [upcast to float and downcast ](`dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73)`) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D55485840](https://our.internmc.facebook.com/intern/diff/D55485840) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364 Approved by: https://github.com/albanD	2024-03-29 18:05:28 +00:00
David Berard	59f6393209	[docs] Update PT2+Profiler docs (#122272 ) Document: * Torch-Compiled Region * What to expect in kernels inside a torch-compiled region For review, see https://docs-preview.pytorch.org/pytorch/pytorch/122272/torch.compiler_profiling_torch_compile.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/122272 Approved by: https://github.com/aaronenyeshi	2024-03-28 17:52:28 +00:00
PyTorch MergeBot	8698121636	Revert "Add RMSNorm module (#121364 )" This reverts commit `a7306de0dc`. Reverted https://github.com/pytorch/pytorch/pull/121364 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/121364#issuecomment-2025502007))	2024-03-28 15:31:10 +00:00
Aaron Orenstein	a8b7480f0d	fix dynamo.explain examples (#122745 ) `dynamo.explain()` was updated to return a structure but the docs weren't updated to match. - Update the docs to use the new API - Remove some dead code left when `explain` was updated. - Drive-by: Fix some `nopython` uses that I noticed - Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it. Fixes #122573 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745 Approved by: https://github.com/jansel	2024-03-27 22:53:27 +00:00
Mikayla Gawarecki	a7306de0dc	Add RMSNorm module (#121364 ) Similar to `dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)` The implementation here is not optimized and we welcome pull requests to improve this - Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation - Remove the [upcast to float and downcast ](`dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364 Approved by: https://github.com/albanD	2024-03-27 21:39:30 +00:00
Frank Lin	249e65b92d	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9	2024-03-27 01:14:38 +00:00
Edward Z. Yang	85845a29db	Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310 Approved by: https://github.com/masnesral, https://github.com/lezcano	2024-03-26 14:16:33 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit `e9dcda5cba`. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
Edward Z. Yang	476585b190	Preserve unbacked SymInt on SymNode (#120816 ) Previously, when we applied a replacement, a SymInt that was previously an unbacked SymInt would then transmute into whatever we replaced it into (e.g., a constant). This has a major downside: we often look at SymInts associated with FX nodes (e.g., the meta of x.item() return) to find out where the unbacked SymInt was allocated. If we replace it, we no longer can find out where, e.g., u1 was allocated! But we need to know this so we can generate deferred runtime asserts like u1 == s0. To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites: * `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`. * `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred. * `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False` * `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue * `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression. There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming. Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR. Fixes https://github.com/pytorch/pytorch/issues/119689 Fixes https://github.com/pytorch/pytorch/issues/118385 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816 Approved by: https://github.com/lezcano	2024-03-24 02:56:16 +00:00
liqunfu	bbe846f430	Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828 ) Start to fix https://github.com/pytorch/pytorch/issues/114801 Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828 Approved by: https://github.com/thiagocrepaldi	2024-03-22 18:01:33 +00:00
Sahdev Zala	17175cdbc7	[Docs] Add extended debugging options for troubleshooting (#122028 ) Fixes #120889 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-03-21 17:00:45 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Nathan	ae983d2d6e	Fix typo in sparse.rst (#121826 ) Change word "on" to "one" when talking in the third person. Fixes #121770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826 Approved by: https://github.com/janeyx99	2024-03-19 00:17:19 +00:00
Jane Xu	37e563276b	Document complex optimizer semantic behavior (#121667 ) <img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667 Approved by: https://github.com/albanD	2024-03-16 00:43:47 +00:00
Tugsbayasgalan Manlaibaatar	53d2188df9	Update get_aten_graph_module (#121937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937 Approved by: https://github.com/andrewor14	2024-03-15 20:35:55 +00:00
Aidyn-A	af86d67d61	[Doc][NVTX] Add documentation for nvtx.range (#121699 ) The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699 Approved by: https://github.com/janeyx99	2024-03-15 20:26:44 +00:00
lezcano	d0d09f5977	Fix torch.compile links (#121824 ) Fixes https://github.com/pytorch/pytorch.github.io/issues/1567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824 Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet ghstack dependencies: #121823	2024-03-15 19:49:37 +00:00
Matthias Reso	a9274c9a2c	Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672 ) This PR corrects the example in the AOTInductor example which currently fails with: ``` /home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’ 21 \| std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl; \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672 Approved by: https://github.com/desertfire	2024-03-12 23:43:40 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit `d482614fec`. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
Lucas Pasqualin	d482614fec	[DCP] Makes fsspec public (#121508 ) Fixes #118033 Also removes `_checkpointer.py` class original PR's: - https://github.com/pytorch/pytorch/pull/121330 - https://github.com/pytorch/pytorch/pull/121329 We're also disabling `test_fsdp` since it is failing on random PR's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508 Approved by: https://github.com/fegin	2024-03-09 01:14:18 +00:00
chilli	ed8eebd1c2	Changed cublas repdocubility URL (#121534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534 Approved by: https://github.com/Skylion007	2024-03-08 23:46:21 +00:00
angelayi	f2d5e96db4	[export] Add docs for 2.3 release (#121466 ) - Added docs about non-strict export - Added example using derived dims - Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480) - Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466 Approved by: https://github.com/zhxchen17	2024-03-08 22:29:48 +00:00
Ke Wen	c78f72d7e7	[c10d] Deprecate torch.distributed.pipeline (#121464 ) In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464 Approved by: https://github.com/wz337, https://github.com/awgu	2024-03-08 19:55:02 +00:00
Wanchao Liang	30982ce072	[tp] doc fixes (#121431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431 Approved by: https://github.com/wz337	2024-03-08 17:46:44 +00:00
Karol Blaszczak	8ed0932172	Update link to OpenVINO backend in torch.compiler.rst (#121303 ) This is a permalink, so it will remain active regardless of documentation version changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303 Approved by: https://github.com/soulitzer	2024-03-08 08:17:13 +00:00
Chien-Chin Huang	2e789ad522	[DCP][state_dict][doc] Update the distributed state_dict document (#121290 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290 Approved by: https://github.com/LucasLLC ghstack dependencies: #121273, #121276	2024-03-08 07:58:18 +00:00
Lucas Pasqualin	96ed37ac13	[DCP] Makes async_save public (#121325 ) Makes async_save public Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325 Approved by: https://github.com/wz337 ghstack dependencies: #121317	2024-03-08 05:13:13 +00:00
Cheng Ni	9bff1599b6	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 ) Summary: ## No Functional Change - Refactor Subprocess Handler into a separate folder for easier subclassing - SubprocessHandler - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class - pass in `local_rank_id` from subprocess start Test Plan: No functional changes. Differential Revision: D54038627 #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373 Approved by: https://github.com/kurman	2024-03-08 01:37:34 +00:00
Dheeraj Peri	b1657beac1	feat: Add min, max ranges to mark_dynamic API (#119737 ) Fixes https://github.com/pytorch/pytorch/issues/115137 This PR adds: - mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim. - test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds. Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-03-07 23:26:03 +00:00
suo	c3c15eb9a6	[export] update docs to not export raw functions (#121272 ) as title Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272 Approved by: https://github.com/zhxchen17	2024-03-07 17:18:07 +00:00
Wanchao Liang	1a28ebffb3	[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295 ) As titled, this PR introduces a dedicated `ParallelStyle` to shard the nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual distribute_module calls before when sharding the RMSNorm layer, but I think we should have a dedicate TP API to easily shard those layers, instead of user manually using DTensors. I call this SequenceParallel, which might bring some confusion that we technically "deprecated" a SequenceParallel style months ago. But this time the SeuqenceParallel style is significantly different with the previous ones (which used to shard two consecutive Linear layers). I believe making it the right name is the first priority, instead of worrying about the issue of reusing the old name Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #121294	2024-03-07 02:04:59 +00:00
Zhengxu Chen	8aeb247a3d	[export] Remove WrapperModule. (#121042 ) Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D54326331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042 Approved by: https://github.com/angelayi	2024-03-05 18:10:22 +00:00
Tianyu Liu	af5376c444	[dtensor] add support for loss parallel (#119877 ) Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input replicated on the class dimension. 3. However when the input of this loss calculation is sharded on the class dimension, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives in the middle of those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to decompose these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877 Approved by: https://github.com/wanchaol	2024-03-02 05:06:26 +00:00
Lucas Pasqualin	9d5dea7812	[DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816 ) as title Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816 Approved by: https://github.com/fegin	2024-03-01 00:21:05 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Oleg Khabinov	4b18ab869f	[torch.export] Support is_compiling() flag for non-strict mode (#119602 ) Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models. Test Plan: Unit tests and manual testing. Differential Revision: D53624452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602 Approved by: https://github.com/suo	2024-02-29 05:52:51 +00:00
Yu, Guangye	12995a5d9d	[2/2] Intel GPU Runtime Upstreaming for Generator (#118613 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 05:28:11 +00:00
Lucas Pasqualin	1c1028ac49	[DCP] Adds utility for converting torch save to dcp (#119815 ) as title Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815 Approved by: https://github.com/fegin ghstack dependencies: #119813, #119814	2024-02-22 17:22:11 +00:00
Lucas Pasqualin	1ab441a7dd	[DCP] Adds utility for converting dcp to torch save format (#119814 ) as title Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814 Approved by: https://github.com/fegin ghstack dependencies: #119813	2024-02-22 16:55:58 +00:00
Hugues de Saxcé	8464654ae4	Add missing words to torch.utils.checkpoint doc (#120196 ) This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it. Changes are: - "backward." -> "backward propagation." - "to be advanced than" -> "to be more advanced than" Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196 Approved by: https://github.com/soulitzer	2024-02-20 20:18:42 +00:00
andrewor14	6ea4480818	[quant][pt2e] Add `model_is_exported` util function (#119726 ) Summary: This commit adds the `model_is_exported` util function for users to be able to easily tell what APIs to call to move their models between train and eval modes. This has the additional advantage of hiding the implementation of how we detect a model is exported, in case the metadata format changes in the future. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726 Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD	2024-02-16 19:29:36 +00:00
soulitzer	312ce35c1f	Rename singleton int to nested int (#119661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661 Approved by: https://github.com/ezyang	2024-02-16 19:21:17 +00:00
Yu, Guangye	8f9f12c068	Intel GPU Runtime Upstreaming for Device Allocator (#118091 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel. # Design In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below. <p align="center"> <img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218"> </p> # Additional Context We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`. Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR. In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`. The differences with CUDA: only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment... Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #117611, #117619, #117734	2024-02-16 06:46:00 +00:00
Yu, Guangye	4dc75f9084	Intel GPU Runtime Upstreaming for Event (#117734 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`. # Design `XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively. # Additional Context It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA. lack of the below APIs: - `torch.cuda.Event.ipc_handle` - `CUDAEvent`'s constructor with `IpcEventHandle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117611, #117619	2024-02-16 06:28:26 +00:00
lancerts	444c628e06	Include the scalar tensor auto-transfer in the doc (#119967 ) Fixes #119609 @albanD Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967 Approved by: https://github.com/albanD	2024-02-15 22:37:39 +00:00
drisspg	744898b311	Add doc page for environment variables that effect PyTorch Runtime (#119087 ) # Summary The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087 Approved by: https://github.com/janeyx99, https://github.com/eqy	2024-02-15 21:41:38 +00:00
Kazuaki Ishizaki	a2f07bb317	Fix typo under docs directory (#119657 ) This PR fixes typo under `docs` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657 Approved by: https://github.com/colesbury	2024-02-15 21:14:34 +00:00
Eddie Yan	cd380c794f	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-02-14 22:02:06 +00:00
andrewor14	8ec8d78ef2	[quant][pt2e][be] Rename eval_utils -> export_utils (#119725 ) It's not really eval_utils anymore, since we added some training related utils. Instead it should be util functions that are related to general export use cases. Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725 Approved by: https://github.com/tugsbayasgalan	2024-02-13 19:10:06 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
Mikayla Gawarecki	3372aa51b4	Integrate swap_tensors into nn.Module.load_state_dict (#117913 ) Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors` * `param.module_load(sd[key])` This method can be overridden using `__torch_function__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913 Approved by: https://github.com/albanD	2024-02-09 22:32:29 +00:00
Angela Yi	0827510fd3	[export] Remove torch._export.export (#119095 ) XLA changes: https://github.com/pytorch/xla/pull/6486 Test Plan: CI Differential Revision: D53316196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168	2024-02-08 21:22:04 +00:00
Yu, Guangye	9a992b0918	[4/4] Intel GPU Runtime Upstreaming for Device (#116869 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization. # Design This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability. # Additional Context We adopt a similar design to CUDA. So we share some code with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet ghstack dependencies: #119248	2024-02-08 03:01:21 +00:00
Mateus Devino	64aaa8f508	Fix typo on Contribution Guide (#119428 ) Fixes #119427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428 Approved by: https://github.com/awgu, https://github.com/kit1980	2024-02-08 01:07:27 +00:00
Mikayla Gawarecki	d5a718d27b	Add swap_tensors path to nn.Module._apply (#117167 ) Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass. From offline discussion, for now we are not allowing `swap_tensor` after the first module forward has been run* if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1. The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](`6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)`). Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary. From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected. If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error. `RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now* Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167 Approved by: https://github.com/albanD ghstack dependencies: #118028	2024-02-07 18:55:44 +00:00
Peter Bell	7c95cc5e03	Add basic reference documentation for symbolic_shapes.py (#118997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118997 Approved by: https://github.com/albanD	2024-02-07 14:33:42 +00:00
Svetlana Karslioglu	5ae6f6cffe	Test seo torch cuda (#119324 ) Testing if this will help improve SEO of this page. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119324 Approved by: https://github.com/albanD	2024-02-07 00:39:51 +00:00
Edward Z. Yang	3f0fd36835	Introduce size oblivious guards (#118579 ) Fixes https://github.com/pytorch/pytorch/issues/117361 The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one. This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds. The infra pieces of this PR are: * Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv * When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`. * Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way. The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises. As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.) When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579 Approved by: https://github.com/eellison, https://github.com/lezcano	2024-02-06 19:45:32 +00:00
Edward Z. Yang	6620176da7	Add documentation for meta device (#119119 ) Fixes https://github.com/pytorch/pytorch/issues/119098 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119119 Approved by: https://github.com/bdhirsh	2024-02-04 01:05:22 +00:00
Mikayla Gawarecki	9ffed22391	Document file format returned by torch.save (#118719 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118719 Approved by: https://github.com/albanD	2024-02-03 02:11:44 +00:00
Yu, Guangye	a205e7bf56	[3/4] Intel GPU Runtime Upstreaming for Device (#116850 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`. # Design This PR primarily offers device-related APIs in python frontend, including - `torch.xpu.is_available` - `torch.xpu.device_count` - `torch.xpu.current_device` - `torch.xpu.set_device` - `torch.xpu.device` - `torch.xpu.device_of` - `torch.xpu.get_device_name` - `torch.xpu.get_device_capability` - `torch.xpu.get_device_properties` - ==================== - `torch.xpu._DeviceGuard` - `torch.xpu._is_compiled` - `torch.xpu._get_device` # Additional Context We will implement the support of lazy initialization in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-01 12:31:26 +00:00
Michael Lazos	e426924c19	Change classification to beta for TORCH_LOGS (#118682 ) Changes classification of TORCH_LOGS to beta Pull Request resolved: https://github.com/pytorch/pytorch/pull/118682 Approved by: https://github.com/svekars	2024-01-31 21:50:55 +00:00
CaoE	bacbad5bc9	add GradScaler on CPU (#109993 ) Step 2 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-29 23:42:35 +00:00
albanD	a40be5f4dc	Autograd doc cleanup (#118500 ) I don't think we'll realistically go though deprecation for these now since there are a couple use of each online. So document appropriately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118500 Approved by: https://github.com/soulitzer	2024-01-29 21:51:33 +00:00
Will Constable	abe3c55a6a	Update DDP dynamo debug docs (#118295 ) Refreshes https://github.com/pytorch/pytorch/pull/114201 and updates it to include other log names that also include ddp_optimizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118295 Approved by: https://github.com/LucasLLC, https://github.com/wanchaol	2024-01-29 14:58:26 +00:00
Tobias Ringwald	62c1e4a578	Added missing CircularPad*d references so the docs are actually built. (#118465 ) Fixes #118429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118465 Approved by: https://github.com/Skylion007	2024-01-27 22:39:01 +00:00
Lucas Pasqualin	ff8e33556e	Enables load balancing duplicates in DCP (#116469 ) Enables the deduplication of saved entries by load balancing duplicates across ranks. Tested with existing and modified tests. Additionally tested with the following code snippet, which saves a 20GB DDP model in ~3 seconds on 8 ranks. Before this PR, the same operation has been measured at ~19 seconds. ``` def run(local_rank, world_size, param_size, num_params, work_dir): os.environ["RANK"] = str(local_rank) os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12355" device = torch.device(f"cuda:{local_rank}") torch.cuda.set_device(device) dist.init_process_group(backend="nccl", rank=local_rank, world_size=world_size) model = Model(param_size=param_size, num_params=num_params) model = DistributedDataParallel(model, gradient_as_bucket_view=True) _patch_model_state_dict(model) sz = sum(t.nelement() * t.element_size() for t in model.parameters()) rank_0_print(f"Model size: {sz / 1_000_000_000.0} GB") rank_0_print("Saving the model with DCP...") checkpointer = _FileSystemCheckpointer( f"{args.work_dir}/dcp", sync_files=False, single_file_per_rank=False, thread_count=1 ) begin_ts = time.monotonic() checkpointer.save(state_dict={"model": model}) end_ts = time.monotonic() rank_0_print(f"Took {end_ts - begin_ts} seconds with DCP") ``` Differential Revision: [D52435926](https://our.internmc.facebook.com/intern/diff/D52435926/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116469 Approved by: https://github.com/fegin, https://github.com/wz337	2024-01-26 22:34:14 +00:00
Sherlock Huang	6596a3f23d	[Export] Remove ScriptObjectMeta (#118241 ) Summary: As title. Use CustomObjArgument as ScriptObjectMeta Test Plan: CIs Reviewed By: zhxchen17 Differential Revision: D53062230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118241 Approved by: https://github.com/zhxchen17	2024-01-26 00:37:19 +00:00
drisspg	4e29f01bf2	Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689 ) # Summary Simplification of Backend Selection This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager. For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations. Problems: - This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend. - This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend. - Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful. Other concerns: - Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends). A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689 Approved by: https://github.com/cpuhrsch	2024-01-24 22:28:04 +00:00
Zhengxu Chen	abd759d50d	[fx] Add hooks to intercept node replacements. (#117825 ) Summary: Adding an experimental API to FX graph module to place "hooks" every time when we are changing or replacing nodes in a graph, so that we can properly update the new name in graph signature and potentially other places. Test Plan: buck test mode/opt -c fbcode.enable_gpu_sections=true caffe2/test/distributed/_tensor/experimental:tp_transform buck test mode/opt caffe2/test:test_export -- -r test_replace_hook Differential Revision: D52896531 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117825 Approved by: https://github.com/avikchaudhuri	2024-01-23 22:28:40 +00:00
Matteo Migliarini	fdac55c35d	Added example regarding weight_decay distinction with per-parameter API (#117436 ) Added new example and description regarding per-parameter `weight_decay` distinction for bias and non-bias terms. Fixes #115935 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117436 Approved by: https://github.com/janeyx99	2024-01-22 21:26:02 +00:00
Wanchao Liang	2bb2cc0b71	[tp] add clarification to doc and improve TP examples (#117618 ) This PR adds a clarification about evenly sharded assumption in the main tp doc and improved the tp examples by adding device mesh constructions fixes https://github.com/pytorch/pytorch/issues/100044 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117618 Approved by: https://github.com/wconstab, https://github.com/awgu	2024-01-22 18:56:50 +00:00
Stas Bekman	86b4b27e26	[docs] start a new FSDP notes doc (#117323 ) As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion. I hope I did the RST right, I haven't done RST in a while. - The first section is Andrew's words verbatim + formatting - The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better. tagging @albanD as requested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323 Approved by: https://github.com/awgu	2024-01-22 15:46:35 +00:00
PyTorch MergeBot	02209b5880	Revert "[docs] start a new FSDP notes doc (#117323 )" This reverts commit `7f474da6bc`. Reverted https://github.com/pytorch/pytorch/pull/117323 on behalf of https://github.com/awgu due to broke docs ([comment](https://github.com/pytorch/pytorch/pull/117323#issuecomment-1902740900))	2024-01-21 19:47:27 +00:00
Stas Bekman	7f474da6bc	[docs] start a new FSDP notes doc (#117323 ) As discussed on [slack](https://pytorch.slack.com/archives/C3PDTEV8E/p1703699711772289) adding Andrew Gu's advanced FSDP design notes with a few additions from myself based on our discussion. I hope I did the RST right, I haven't done RST in a while. - The first section is Andrew's words verbatim + formatting - The second section is Andrew's words verbatim + formatting + a few of my additions that were confirmed by Andrew, and which hopefully should help understand the process better. tagging @albanD as requested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117323 Approved by: https://github.com/albanD, https://github.com/awgu	2024-01-21 15:11:24 +00:00
suo	4057d005ff	Initial torchbind support in PT2 (#117697 ) This PR adds the bare minimum functionality to get torchbind working in an e2e testable way on PT2. It implements: * ProxyTensor support * Simple torch.export support (proxytensor-only path, e.g. non-strict). * add some tests exercising the path. Because all this is not fully baked, I hide the functionality behind a feature flag (`enable_torchbind_tracing()`) so it does not affect regular users for now. Still on the agenda: * Dynamo support * Actual FakeMode support * Mutability support Hoping to get this first bit in as a standalone, as it will unblock some more extensive experimentation/testing going on internally. Differential Revision: [D51825372](https://our.internmc.facebook.com/intern/diff/D51825372/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117697 Approved by: https://github.com/SherlockNoMad	2024-01-19 06:28:20 +00:00
PyTorch MergeBot	2f84a9d37c	Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 )" This reverts commit `5aa92b5090`. Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))	2024-01-18 23:40:30 +00:00
Angela Yi	92d718aed1	[export] Add lifted constant obj to input (#116985 ) Test Plan: wip Differential Revision: D52556070 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116985 Approved by: https://github.com/suo	2024-01-18 22:10:53 +00:00
suo	ccc8440609	[export] introduce WrapperModule (#117571 ) Simple module to wrap a callable. This is a useful utility for when we start requiring that torch.export take an nn.Module. Differential Revision: [D52791310](https://our.internmc.facebook.com/intern/diff/D52791310/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117571 Approved by: https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri ghstack dependencies: #117570	2024-01-18 03:40:34 +00:00
Eddie Yan	5aa92b5090	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-01-18 01:20:36 +00:00
Kurman Karabukaev	a60b566d37	[TorchElastic] Support for overprovisioning in C10 based rendezvous (#117066 ) Summary: Allow TorchElastic to manage more nodes than a maximum nnodes specifed in a job. It will be used as a spare capacity/warm nodes for schedulers that support elasticity. RFC: https://github.com/pytorch/pytorch/issues/114097 Test Plan: Integration tests Differential Revision: D52343874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117066 Approved by: https://github.com/zdevito	2024-01-18 01:16:55 +00:00
Peter Bell	001585f446	[fx][inductor] Add statically_known_true utility for SymBool (#117359 ) This adds a function `statically_known_true` for `SymBool` that works like inductor's `is_expr_static_and_true`. That is, it tries to simplify the expression to a constant or returns `False` if it cannot be simplified. This is useful in cases that can be optimized if the condition is met, otherwise it doesn't effect correctness so we can avoid adding guards. I also use this new function in inductor for `FakeTensorUpdater` and `remove_noop_pass` which both generated unexpected guards previously. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117359 Approved by: https://github.com/lezcano	2024-01-15 18:01:10 +00:00
Sai-Pra	19502ff6aa	Fixed typo in build_activation_images.py (#117458 ) In line 24 of build_activation_images.py, I changed "programmaticly" to "programmatically" to be dramatically correct. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117458 Approved by: https://github.com/malfet	2024-01-15 03:27:40 +00:00
vasiliy	a6d33614d6	add float8 types to dtypes table (#117375 ) Summary: As titled Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/117375 Approved by: https://github.com/ezyang	2024-01-15 00:23:07 +00:00
Edward Z. Yang	d006cae2a8	Update documentation for unsigned int types (#116804 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116804 Approved by: https://github.com/albanD ghstack dependencies: #116595, #116803	2024-01-08 22:02:10 +00:00
Guo Yejun	5323b2daa5	[docs] add mode="reduce-overhead" into torch.compile to enable cuda g… (#116529 ) …raph Pull Request resolved: https://github.com/pytorch/pytorch/pull/116529 Approved by: https://github.com/eellison	2024-01-05 22:54:20 +00:00
Angela Yi	6413511713	[export][refactor][4/n] Make equality_constraints optional (#116233 ) Summary: needed to remove equality_contraints eventually :P Test Plan: CI Differential Revision: D52351709 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116233 Approved by: https://github.com/tugsbayasgalan	2024-01-05 00:50:52 +00:00
Mikayla Gawarecki	0f6f582c0d	Add config to disable TransformerEncoder/MHA fastpath (#112212 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112212 Approved by: https://github.com/jbschlosser	2024-01-02 23:59:30 +00:00
lezcano	b18d8d4595	Add a wrapper to transform a NumPy function into a PyTorch function (#114610 ) A less general version of this wrapper was used in the keynote on `torch.compile(numpy)`. We expose a generic version of the wrapper that works seamlessly with `torch.compile`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114610 Approved by: https://github.com/albanD	2024-01-02 18:35:29 +00:00
Anupam Bhatnagar	4371939751	Removing HTA documentation (#116513 ) Removing HTA documentation Pull Request resolved: https://github.com/pytorch/pytorch/pull/116513 Approved by: https://github.com/aaronenyeshi, https://github.com/malfet, https://github.com/atalman	2023-12-28 23:04:23 +00:00
angelayi	6b91e6907e	Add setUserEnabledNNPACK config (#116152 ) When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[`cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)`]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function. Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110. To test the flag, the following script runs successfully: ``` import os import torch from torchvision.models import ResNet18_Weights, resnet18 torch.set_float32_matmul_precision("high") model = resnet18(weights=ResNet18_Weights.DEFAULT) model.eval() with torch.no_grad(): # device = "cuda" if torch.cuda.is_available() else "cpu" torch.backends.mkldnn.set_flags(False) torch.backends.nnpack.set_flags(False) # <--- Added config device = "cpu" model = model.to(device=device) example_inputs = (torch.randn(2, 3, 224, 224, device=device),) batch_dim = torch.export.Dim("batch", min=2, max=32) so_path = torch._export.aot_compile( model, example_inputs, # Specify the first dimension of the input x as dynamic dynamic_shapes={"x": {0: batch_dim}}, # Specify the generated shared library path options={ "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"), "max_autotune": True, }, ) ``` I'm not sure who to add as reviewer, so please feel free to add whoever is relevant! Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152 Approved by: https://github.com/malfet	2023-12-27 06:00:16 +00:00
Lucas Pasqualin	b342286646	adds async save, makes checkpointer private (#116293 ) Adds Async Save and also makes `Checkpointer` classes private. The original PR was here: https://github.com/pytorch/pytorch/pull/115864 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116293 Approved by: https://github.com/fegin	2023-12-22 05:22:39 +00:00
suo	bc3ef1684e	[export] refactor unflatten.py to be a top-level API (#115466 ) This is in preparation for the merging of the internal and external versions of the unflattener. Unflatten needs to be its own API because we are adding more options to it in forthcoming diffs. Differential Revision: [D52001133](https://our.internmc.facebook.com/intern/diff/D52001133/) @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/115466 Approved by: https://github.com/zhxchen17	2023-12-21 20:52:29 +00:00
Damien	2d2016fdf8	WIP Add compatibility with channels_last_3d for conv3d (#114790 ) Part of a multi-PR work to fix #59168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114790 Approved by: https://github.com/albanD	2023-12-20 19:28:25 +00:00
Bin Bao	fabf9433e7	[AOTI][refactor] Organize model runner files (#116022 ) Summary: Move runner util files into a subdirectory and put AOTIModelContainerRunnerCpu into a separate file Differential Revision: [D52300693](https://our.internmc.facebook.com/intern/diff/D52300693) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116022 Approved by: https://github.com/khabinov	2023-12-20 15:35:34 +00:00
FFFrog	327bdcdb14	Some tiny modification about torch.set/get_default_device (#116014 ) 1. fix bug of torch.set_default_device in multi-threading 2. add new interface named torch.get_default_device Fixes #115333 Fixes #115917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116014 Approved by: https://github.com/malfet, https://github.com/jansel	2023-12-19 05:08:06 +00:00
Wanchao Liang	61abacf829	[tp] improve documentation (#115880 ) Improve the TP documentation in terms of format and descriptions Pull Request resolved: https://github.com/pytorch/pytorch/pull/115880 Approved by: https://github.com/XilunWu	2023-12-15 18:44:22 +00:00
Will Constable	28e4004286	Add doc for torch.distributed.breakpoint (#115656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115656 Approved by: https://github.com/wanchaol, https://github.com/fegin ghstack dependencies: #115705	2023-12-14 14:45:36 +00:00
angelayi	dd9a989b83	[export][reland][refactor][1/n] Split dynamic shapes (#115556 ) Reland of https://github.com/pytorch/pytorch/pull/114764 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115556 Approved by: https://github.com/zhxchen17	2023-12-12 05:36:41 +00:00
atalman	b88be1686d	Revert "[export][refactor][1/n] Move dynamic shapes logic (#114764 )" (#115508 ) GitHub first oncall. This reverts commit `53bf8cfcf9`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115508 Approved by: https://github.com/malfet, https://github.com/angelayi	2023-12-11 14:54:51 +00:00
William Wen	f614ed78b8	[docs, dynamo] fix typos in dynamo custom backend docs (#115444 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115444 Approved by: https://github.com/eellison	2023-12-08 23:58:26 +00:00
albanD	a2b89154bf	New swap function (#111747 ) This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem. The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references. This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up. This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs). Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots. The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses. Draft right now to see what @colesbury thinks about doing this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747 Approved by: https://github.com/colesbury	2023-12-08 18:49:35 +00:00
Linus	5f2ff29569	Fix typo in `https://pytorch.org/docs/stable/sparse.html` (#115282 ) Fixes #111473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115282 Approved by: https://github.com/svekars	2023-12-08 18:31:33 +00:00
Wongboo	68f74dd162	Add python and C++ support for LPPool3d (#114199 ) Add python and C++ support for LPPool3d to Fixes #114114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114199 Approved by: https://github.com/mikaylagawarecki	2023-12-08 18:18:44 +00:00
Iris Zhang (PyTorch)	23fa9621e4	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) (#115193 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, add documentation. We created stubs for public class and methods in torch.distributed.device_mesh so that torch.distributed.device_mesh can be imported with or without distributed is available(). Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/115099 Prior to landing, CI signals are all passed. Shipit added the "ci/trunk" label to the PR and DID NOT wait for it and went ahead committing. More context can be found in the reverted PR above. Test Plan: CI. Differential Revision: D51861018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115193 Approved by: https://github.com/fegin	2023-12-08 08:44:32 +00:00
Lucas Pasqualin	5432088098	Adds Checkpointer Wrapper for DCP [3/N] (#114603 ) Adds a useful high level wrapper for calling `dist.save/load` with the correct storage readers and writers. Instead of doing: ``` DCP.save( state_dict={...}, storage_writer=StorageWriter(...) ) DCP.load( state_dict={...}, storage_reader=StorageReader(...) ) ``` We can now do: ``` checkpointer = Checkpointer(...) checkpointer.save(state_dict={...}) checkpointer.load(state_dict={...}) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114603 Approved by: https://github.com/fegin, https://github.com/wz337	2023-12-08 01:03:21 +00:00
Howard Huang	3e66385ddd	Add Work to distributed docs (#115172 ) Summary: Documenting the `Work` object For a collective (broadcast, all_reduce, etc.) when async_op=True we return a `Work` object to which users can call `.wait()`, `.is_success()`, among other things but this class is not documented Test Plan: Preview the docs build in OSS Differential Revision: D51854974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115172 Approved by: https://github.com/wconstab	2023-12-07 18:12:10 +00:00
angelayi	53bf8cfcf9	[export][refactor][1/n] Move dynamic shapes logic (#114764 ) 1/n of refactoring export code: * Moved dynamic shapes/constraints/dynamic_dims logic in torch/_export/__init__.py and torch/export/__init__.py to torch/export/dynamic_shapes.py Differential Revision: [D51823962](https://our.internmc.facebook.com/intern/diff/D51823962) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114764 Approved by: https://github.com/ydwu4	2023-12-06 16:46:38 +00:00
drisspg	d4c79a3078	Add an attention bias subclass for a lower right causal masking (#114823 ) # Summary This PR introduces a new Tensor subclass that is designed to be used with torch.nn.functional.scaled_dot_product_attention. Currently we have a boolean `is_causal` flag that allows users to do do causal masking without the need to actually create the "realized" attention bias and pass into sdpa. We originally added this flag since there is native support in both fused kernels we support. This provides a big performance gain ( the kernels only need to iterate over ~0.5x the sequence, and for very large sequence lengths this can provide vary large memory improvements. The flag was introduced when the early on in the kernel development and at the time it was implicitly meant to "upper_left" causal attention. This distinction only matters when the attention_bias is not square. For a more detailed break down see: https://github.com/pytorch/pytorch/issues/108108. The kernels default behavior has since changed, largely due to the rise of autogressive text generation. And unfortunately this would lead to a BC break. In the long term it may actually be beneficial to change the default meaning of `is_causal` to represent lower_right causal masking. The larger theme though is laid here: https://github.com/pytorch/pytorch/issues/110681. The thesis being that there is alot of innovation in SDPA revolving around the attention_bias being used. This is the first in hopefully a few more attention_biases that we would like to add. The next interesting one would be `sliding_window` which is used by the popular mistral model family. Results from benchmarking, I improved the meff_attention perf hence the slightly decreased max perf. ```Shell +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| embed_dim \| dtype \| head_dim \| +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ \| Average \| 1.2388050062214226 \| \| \| \| \| \| \| \| \| Max \| 1.831672915579016 \| 128 \| 32 \| 1024 \| 2048 \| 2048 \| torch.bfloat16 \| 64 \| \| Min \| 0.9430534166730135 \| 1 \| 16 \| 256 \| 416 \| 2048 \| torch.bfloat16 \| 128 \| +---------+--------------------+------------+-----------+-----------+-----------+-----------+----------------+----------+ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114823 Approved by: https://github.com/cpuhrsch	2023-12-06 08:29:26 +00:00
Joel Schlosser	22704426c3	Expand dynamic dims support for traceable subclasses (#114311 ) Continuation of #112185, following the design in this [doc](https://docs.google.com/document/d/1ipSxcTzEMMOAPvxP-YJlD5JBZZmIGgh8Q34ixtOUCRo). Summary: * Introduce `SubclassSymbolicPolicy` containing separate dynamic dim / constraint policies for the outer and inner tensors * Expand the automatic dynamic algorithm to recurse into inner tensors and produce one of these for a subclass instance * Maintain legacy behavior for subclasses by recursively calling `mark_dynamic()` on inner tensors of the same dim as outer when `mark_dynamic(outer, ...)` is called * Addresses this: `6a86cf00ad/torch/_dynamo/variables/builder.py (L1750)` * Add `outer_size` and `outer_stride` arguments to `__tensor_unflatten__()` so that you can find out what symbols were allocated for the outer size / stride (you are expected to return a tensor that compares equal to the outer symbols) * Signatures now: ```python # attrs is a list of inner tensor attributes on x; inner_tensor = getattr(x, attr) # ctx is anything useful for rebuilding the class we want to guard on attrs, ctx = x.__tensor_flatten__() ... # inner_tensors is a dict of {attr -> tensor} # ctx is taken unmodified from flattening and (eventually) guarded on # outer_size is the expected size of the output; possibly symbolic # outer_stride is the expected strides of the output; possibly symbolic y = MySubclass.__tensor_unflatten__(inner_tensors, ctx, outer_size, outer_stride) # at the __tensor_unflatten__() call-site in PT2, we assert y.shape == outer_size and y.stride() == outer_stride # the assert simplifies symbols when there are relationships between outer and inner symbols ``` * Size info needed for `NestedTensor` at least, stride info needed for `DTensor` at least * Punting on `outer_storage_offset` because storage_offset handling is horribly broken in PT2 right now * ~~Add new `__tensor_mark_dynamic__()` to allow overriding the behavior of mark_dynamic on a per-subclass basis~~ (booted to future work) * ~~Add guards for tensor subclasses by calling `__tensor_flatten__()` in the guard to test equality on `ctx`~~ * Now handled in #114469 * Next PR: add TENSOR_MATCH guards on inner tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/114311 Approved by: https://github.com/ezyang, https://github.com/drisspg, https://github.com/voznesenskym, https://github.com/bdhirsh	2023-12-05 21:09:25 +00:00
angelayi	5fdae89c03	[docs][aoti] Link to export docs in AOTI docs (#115088 ) Context: https://fb.workplace.com/groups/1075192433118967/posts/1341833143121560/?comment_id=1341841786454029 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115088 Approved by: https://github.com/desertfire	2023-12-05 20:22:42 +00:00
Anupam Bhatnagar	85d4708512	HTA docs (#115060 ) Added documentation for Holistic Trace Analysis Pull Request resolved: https://github.com/pytorch/pytorch/pull/115060 Approved by: https://github.com/aaronenyeshi	2023-12-05 19:38:09 +00:00
Nikita Shulga	a827ac71f2	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 )" This reverts commit `eaa64339d6`.	2023-12-05 08:59:36 -08:00
Iris Zhang (PyTorch)	eaa64339d6	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#115099 ) Summary: Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. Original diff reverted: D51629761 Original PR reverted: https://github.com/pytorch/pytorch/pull/114991 It was failing because failing a public module binding tests in MacOS, and this is due to the change in import order for torch/distributed/fsdp/_common_utils.py. Since this original import would still work, we remove the changes in this file. Test Plan: CI. Differential Revision: D51825114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115099 Approved by: https://github.com/wanchaol, https://github.com/fegin	2023-12-05 05:44:52 +00:00
soulitzer	a7bcc78bff	Make it clearer that current selective AC is PT2-only and private (#115081 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115081 Approved by: https://github.com/albanD	2023-12-04 23:01:22 +00:00
Xuehai Pan	55064a4ef9	[BE] add parentheses to kwargs unpacking `func(args, (kwargs or {}))` (#115026 ) This PR adds parentheses to kwargs unpacking `func(args, *(kwargs or {}))` for better code readability. With/without the parentheses are semantic equivalent because they produce the same bytecode. ```console $ echo "func(args, *kwargs or {})" \| python3 -m dis - 0 0 RESUME 0 1 2 PUSH_NULL 4 LOAD_NAME 0 (func) 6 LOAD_NAME 1 (args) 8 BUILD_MAP 0 10 LOAD_NAME 2 (kwargs) 12 JUMP_IF_TRUE_OR_POP 1 (to 16) 14 BUILD_MAP 0 >> 16 DICT_MERGE 1 18 CALL_FUNCTION_EX 1 20 POP_TOP 22 LOAD_CONST 0 (None) 24 RETURN_VALUE $ echo "func(args, **(kwargs or {}))" \| python3 -m dis - 0 0 RESUME 0 1 2 PUSH_NULL 4 LOAD_NAME 0 (func) 6 LOAD_NAME 1 (args) 8 BUILD_MAP 0 10 LOAD_NAME 2 (kwargs) 12 JUMP_IF_TRUE_OR_POP 1 (to 16) 14 BUILD_MAP 0 >> 16 DICT_MERGE 1 18 CALL_FUNCTION_EX 1 20 POP_TOP 22 LOAD_CONST 0 (None) 24 RETURN_VALUE ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/115026 Approved by: https://github.com/Skylion007	2023-12-03 20:03:26 +00:00
PyTorch MergeBot	3a2e2044cd	Revert "[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 )" This reverts commit `729ac7317a`. Reverted https://github.com/pytorch/pytorch/pull/114991 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/114991#issuecomment-1837214567))	2023-12-02 17:55:51 +00:00
Wanchao Liang	28925902fa	[TP] fully rewrite Tensor Parallel APIs (#114732 ) This PR rewrites Tensor Parallel implementation. Tensor Parallel APIs supposed to be a very thin-wrapper to DTensor APIs, but the current implementation got too messy and buggy. It's really hard to debug what went wrong when using it. It's crucially important for advanced users or developers to understand the API and its implementation easily without going through all different types of functions and utils, so that they could trust what happen under the hood. In particular this PR: * Make ParallelStyle to be a real contract API for parallelize_module to take, each concrete ParallelStyle only needs to implement `apply` to apply the sharding to nn.Module, remove all non-necessary fields. This also enable easier ParallelStyle authoring going forward. * Keep the ColwiseParallel and RowwiseParallel public interface, but refactor them in a way that makes the parameter sharding, inputs and outputs handling lives within the style itself, so that it's easy to understand how Linear/Embedding layers are sharded and how the inputs/outputs transformations are performed. * remove all those private _prepare_input/_prepare_output_fn fields for both ColwiseParallel/RowwiseParallel. Since we throw deprecation messages in nightly for a while and TP is on prototype release, the fields are also private, it should be safe to remove them * Refactor the recently landed PrepareModuleInput/Output style, change output_layouts to desired_input/output_layouts, group the function inside the style itself, no default arguments for these two styles and user need to specify them to think about the sharding layouts. Fixed bugs about not handling `use_local_output` flag. * Make default arguments be None instead of Placement object, this is standard python practice to not have custom object instance as default argument * Remove all dead APIs (i.e. PairwiseParallel and SequenceParallel style, all prepare input/output functions) as we throw deprecation msgs for a while, and in the progress of removing all of them from the tests. * throw deprecation warning for `tp_mesh_dim` as we recomemnd use device mesh slice/indexing instead of manually specify mesh dim * Rewrite all documentations for every ParallelStyle and make the documentation more clear about what each style is doing TODOs: * Rewrite TP tests to adjust for the changes we have in this PR * add more tests to guard the bug fixes Differential Revision: [D51761183](https://our.internmc.facebook.com/intern/diff/D51761183) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114732 Approved by: https://github.com/wz337, https://github.com/fduwjj	2023-12-02 08:18:12 +00:00
Iris Zhang (PyTorch)	729ac7317a	[DeviceMesh] Rename _device_mesh.py to device_mesh.py to prepare for beta (#114710 ) (#114991 ) Summary: Same content of changes as https://github.com/pytorch/pytorch/pull/114710 Rename _device_mesh.py to device_mesh.py, update all callsites, adds documentation. ghstack-source-id: 208980207 exported-using-ghexport Test Plan: CI. Reviewed By: wanchaol Differential Revision: D51629761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114991 Approved by: https://github.com/wanchaol, https://github.com/fduwjj, https://github.com/fegin	2023-12-02 04:39:41 +00:00
Rohan Varma	3c78ea4c9d	[DDP][Compile] Test to Ensure torch.compile works w/static_graph=True (#114621 ) Resolves https://github.com/pytorch/pytorch/issues/93672. This was actually fixed by https://github.com/pytorch/pytorch/pull/103487 but I didn't realize that PR also fixes torch compile at the time. Differential Revision: [D51596148](https://our.internmc.facebook.com/intern/diff/D51596148/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/114621 Approved by: https://github.com/wconstab	2023-12-01 22:18:45 +00:00
Lucas Pasqualin	f073dcd4f7	Stateful Checkpointing for Distributed [1/N] (#113867 ) First pass at adding a save/load API, as well as definition of Stateful objects. Amongst a couple todo's, we still need to explore adding an `all_gather` & potentially a `barrier` while iterating through state keys. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113867 Approved by: https://github.com/fegin, https://github.com/wz337	2023-12-01 19:21:03 +00:00
Philip Meier	373f2060ba	fix extending torch native API docs (#114863 ) Couldn't think of a better `release notes:` label. Feel free to set a more fitting one Pull Request resolved: https://github.com/pytorch/pytorch/pull/114863 Approved by: https://github.com/mikaylagawarecki	2023-12-01 06:09:35 +00:00
Jerry Zhang	64fd706b21	[quant][pt2e] Add generate_numeric_debug_handle pass (#114315 ) Summary: This is a util for numeric suite in pt2 export so that we can build a more streamlined UX for numerical debugging in quant + executorch stack Test Plan: python test/test_quantization.py TestGenerateNumericDebugHandle Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/114315 Approved by: https://github.com/zhxchen17	2023-12-01 03:38:17 +00:00
William Wen	38ae17d166	[dynamo, docs] update dynamo backend registration docs (#114820 ) Update docs to reflect current backend registration API. Add `lookup_backend` to root `dynamo` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114820 Approved by: https://github.com/eellison	2023-11-30 21:41:05 +00:00
Nikita Shulga	a9d5133207	[ez][doc] Fix sample code in onnx_dynamo.rst (#114770 ) By adding `import torch.nn as nn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114770 Approved by: https://github.com/atalman, https://github.com/thiagocrepaldi	2023-11-29 19:27:52 +00:00
Guo Yejun	4aa2c51a09	[doc] fix typo on graph 3 that is recorded (#114666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114666 Approved by: https://github.com/eellison	2023-11-28 20:40:13 +00:00
Guo Yejun	4a35ec3c0e	[docs] correct the code for cudagraph trees integration (#114583 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/114583 Approved by: https://github.com/eellison	2023-11-28 20:28:52 +00:00
lezcano	4ba3e6758d	Canonicalize runtime asserts (#114509 ) This allows us to remove quite a few redundant runtime asserts, and potentially a number of guards as well. On ``` python test/dynamo/test_subclasses.py -k test_unbind ``` we go from ``` inserting runtime assert i0 <= s0 inserting runtime assert 0 <= -i0 + s0 inserting runtime assert i0 + i1 <= s0 inserting runtime assert i0 <= -i1 + s0 inserting runtime assert i0 + i1 + i2 <= s0 inserting runtime assert i0 + i1 <= -i2 + s0 inserting runtime assert Eq(i0 + i1 + i2 + i3, s0) inserting runtime assert i0 + i1 + i2 + i3 <= s0 inserting runtime assert i0 + i1 + i2 <= -i3 + s0 ``` to ``` inserting runtime assert i0 - s0 <= 0 inserting runtime assert i0 + i1 - s0 <= 0 inserting runtime assert i0 + i1 + i2 - s0 <= 0 inserting runtime assert Eq(i0 + i1 + i2 + i3, s0) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/114509 Approved by: https://github.com/voznesenskym	2023-11-28 01:38:47 +00:00
voznesenskym	081c5b3adc	Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926 ) (#114526 ) Summary: The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors at the end of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor. This PR is the result of a lot of back and forth with ezyang and eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same: 1) We cache source->symbol in shape_env 2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification 3) We create a new fake mode for backends (from https://github.com/pytorch/pytorch/pull/113605/files) This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't). We went back to the drawing board here, but with a few concessions: 1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons 2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (ezyang did this) cc penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 aakhundov kadeng imported-using-ghimport Test Plan: Imported from OSS Reviewed By: huydhn, Chillee Differential Revision: D51566250 Pulled By: voznesenskym Pull Request resolved: https://github.com/pytorch/pytorch/pull/114526 Approved by: https://github.com/Chillee, https://github.com/huydhn	2023-11-26 23:40:32 +00:00
Akihiro Nitta	d37c4c6995	Update `torch.compiler_troubleshooting.rst` (#114530 ) If you copy and paste the env var in the docs: ```console TORCHDYNAMO_REPRO_AFTER=“aot” ``` it leads to this error: ```python @functools.wraps(unconfigured_compiler_fn) def debug_wrapper(gm, example_inputs, kwargs): compiler_fn = functools.partial(unconfigured_compiler_fn, kwargs) > assert config.repro_after in ("dynamo", "aot", None) E torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: E AssertionError: ``` because `config.repro_after` is being `'“aot”'` but not `'aot'`. --- It would've saved a few minutes of my time 😄 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114530 Approved by: https://github.com/Chillee	2023-11-25 23:15:47 +00:00
PyTorch MergeBot	2f3beb715c	Revert "Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926 )" This reverts commit `2ca1119d53`. Reverted https://github.com/pytorch/pytorch/pull/113926 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/113926#issuecomment-1822713852))	2023-11-22 12:52:33 +00:00
Thiago Crepaldi	3f736c2d77	Add ONNXProgram.__call__ API to run model with ONNX Runtime (#113495 ) Currently the user can use torch.onnx.dynamo_export to export the model. to ONNX. ```python import torch class Model(torch.nn.Module): def forward(self, x): return x + 1.0 onnx_program = torch.onnx.dynamo_export( Model(), torch.randn(1, 1, 2, dtype=torch.float), ) ``` The next step would be instantiating a ONNX runtime to execute it. ```python import onnxruntime # type: ignore[import] onnx_input = self.adapt_torch_inputs_to_onnx(args, *kwargs) options = options or {} providers = options.get("providers", onnxruntime.get_available_providers()) onnx_model = self.model_proto.SerializeToString() ort_session = onnxruntime.InferenceSession(onnx_model, providers=providers) def to_numpy(tensor): return ( tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy() ) onnxruntime_input = { k.name: to_numpy(v) for k, v in zip(ort_session.get_inputs(), onnx_input) } return ort_session.run(None, onnxruntime_input) ``` This PR provides the `ONNXProgram.__call__` method as facilitator to use ONNX Runtime under the hood, similar to how `torch.export.ExportedProgram.__call__` which allows the underlying `torch.fx.GraphModule` to be executed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113495 Approved by: https://github.com/titaiwangms	2023-11-22 01:48:45 +00:00
Antonio Kim	7fc292930c	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-21 23:07:21 +00:00
HDCharles	18e1a37c4e	[ao] updating embedding_bag support for fx and eager (#107623 ) Summary: our docs were saying dynamic embedding bag wasn't supported but it actually is (at least at the same level as embeddings were) it just wasn't previously tested/listed. Test Plan: python test/test_quantization.py -k "test_embedding" Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/107623 Approved by: https://github.com/jerryzh168	2023-11-21 03:54:00 +00:00
Ke Wen	dc65f6c601	[c10d] Remove deprecated multi-gpu-per-thread APIs (#114156 ) As of today, PyTorch Distributed's preferred programming model is one device per thread, as exemplified by the APIs in its document. The multi-GPU functions (which stand for multiple GPUs per CPU thread) have been deprecated for three versions. Removing them now before 2.2 release. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114156 Approved by: https://github.com/albanD, https://github.com/fduwjj, https://github.com/H-Huang	2023-11-21 03:50:23 +00:00
voznesenskym	2ca1119d53	Add Stateful/Stateless symbolic contexts, use fresh fake mode for dynamo backends (#113926 ) The primary problem we are setting out to solve here is fake tensor freshness. Before this PR, fake tensors after dynamo represented fake tensors at the end of trace, so subsequent retraces like aot_autograd would start off with fake tensors in the wrong (end result) state, rather than their expected fresh state. The solution here is to start a fresh fake mode, and re-fakify the tensors. The nuance comes from ensuring that symbols are uniformly created for the symbolic sizes and strides of the tensor. This PR is the result of a lot of back and forth with @ezyang and @eellison. Initially, the first pass at this was not super different from what we have in the PR - the broad strokes were the same: 1) We cache source->symbol in shape_env 2) We pass policy objects around, stored at dynamo fakificaiton time, and reused for later fakification 3) We create a new fake mode for backends (from https://github.com/pytorch/pytorch/pull/113605/files) This is ugly, and has some layering violations. We detoured our decision making through a few other alternatives. Immutable/mutable fake tensor mode was the most interesting alternative, https://github.com/pytorch/pytorch/pull/113653, and was struck down on concerns of complexity in fake mode combined with it not covering all edge cases. We also detoured on what to do about tensor memoization returning back potentially different tensors than requested, and if that was an anti pattern (it is) we want to hack in with the symbol cache (we don't). We went back to the drawing board here, but with a few concessions: 1) the cache for source->symbol must live outside of shape_env, for both lifecycle, and layering reasons 2) A good amount of work needs to be done to pipe policy around fake_mode and meta_utils correctly, to cover all the cases (@ezyang did this) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113926 Approved by: https://github.com/ezyang, https://github.com/eellison	2023-11-20 23:06:37 +00:00
Edward Z. Yang	aeb5fd52c7	Remove dead tensor_has_hints. (#114071 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/114071 Approved by: https://github.com/aakhundov	2023-11-20 16:02:24 +00:00
Pearu Peterson	0bd4d1f4ab	Add sparse tensors support to dataloader. (#112842 ) Fixes https://github.com/pytorch/pytorch/issues/106837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112842 Approved by: https://github.com/cpuhrsch, https://github.com/gokulavasan	2023-11-19 16:05:27 +00:00
Edward Z. Yang	e2b114ab9f	[BE] Package dynamic_dims/constraint_dims into CreateSymbolicPolicy (#113802 ) This will make it more convenient to propagate more information through all of these functions in the future (e.g., for storage offset information.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113802 Approved by: https://github.com/davidberard98, https://github.com/voznesenskym	2023-11-17 18:22:46 +00:00
Edward Z. Yang	3a3a979984	Add torch.distributed.breakpoint (#113775 ) I tested it works by patching ``` diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py index 96b3a82bdfa..dea9bac9302 100644 --- a/test/distributed/test_dynamo_distributed.py +++ b/test/distributed/test_dynamo_distributed.py @@ -18,6 +18,7 @@ from torch._dynamo import config from torch._dynamo.utils import same from torch._dynamo.testing import collect_results from torch.utils._triton import has_triton +import torch.distributed as dist from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy, lambda_auto_wrap_policy from torch._higher_order_ops.wrap import tag_activation_checkpoint from torch.nn.parallel import DistributedDataParallel as DDP @@ -398,6 +399,7 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase): @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch") def test_fsdp_activation_checkpointing(self): with _dynamo_dist_per_rank_init(self.rank, self.world_size): + dist.breakpoint() model, inputs = get_toy_model_for_activation_checkpointing(f"cuda:{self.rank}") is_inner = lambda module: isinstance(module, ToyInnerModel) # noqa: E731 wrap_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=is_inner) ``` and then running `python test/distributed/test_dynamo_distributed.py -k test_fsdp_activation_checkpointing` It prints: ``` ATTENTION!!! Type 'up' to get to the frame that called dist.breakpoint(rank=0) > /data/users/ezyang/c/pytorch/torch/distributed/__init__.py(71)breakpoint() -> barrier() (Pdb) up > /data/users/ezyang/c/pytorch/test/distributed/test_dynamo_distributed.py(402)test_fsdp_activation_checkpointing() -> dist.breakpoint() (Pdb) list 397 398 @skip_if_lt_x_gpu(1) 399 @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch") 400 def test_fsdp_activation_checkpointing(self): 401 with _dynamo_dist_per_rank_init(self.rank, self.world_size): 402 -> dist.breakpoint() 403 model, inputs = get_toy_model_for_activation_checkpointing(f"cuda:{self.rank}") 404 is_inner = lambda module: isinstance(module, ToyInnerModel) # noqa: E731 405 wrap_policy = functools.partial(lambda_auto_wrap_policy, lambda_fn=is_inner) 406 model = apply_fsdp_with_checkpointing(model, wrap_policy, is_inner) 407 correct_outputs = model(inputs) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113775 Approved by: https://github.com/wconstab, https://github.com/wanchaol	2023-11-16 19:30:57 +00:00
Mu-Chu Lee	eddce3c054	[AOTInductor] Rename model_runner to model_container_runner (#111324 ) Summary: We rename the model_runner to model_container_runner to prepare for adding tests of pure model without container. Test Plan: commit itself is a test. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/111324 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2023-11-16 19:14:22 +00:00
Tongzhou Wang	275403be16	[doc] Add nn.parametrizations.weight_norm (#113783 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113783 Approved by: https://github.com/albanD	2023-11-16 17:42:48 +00:00
Iris Zhang	72ce5dd13e	[2D] Remove enable_2d_with_fsdp() API and make remove_enable_2d_with_fsdp private (#112473 ) As we have our new 2D flow out, we want to remove `enable_2d_with_fsdp()`. In addition, we change pre_dp_module_transform to private, as we may need to change the UX later on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112473 Approved by: https://github.com/fegin, https://github.com/wanchaol	2023-11-16 01:14:00 +00:00
gs-olive	757f36b988	[docs] Fix `torch.compile` "tensorrt" backend docs (#113711 ) - Update description from ONNX to current state (Torch-TensorRT) - Add clarification about import Fixes documentation on this page: https://pytorch.org/docs/stable/torch.compiler.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/113711 Approved by: https://github.com/msaroufim	2023-11-15 08:42:53 +00:00
drisspg	9b0f2f8d94	expose sdpa helpers to python (#110496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110496 Approved by: https://github.com/jbschlosser	2023-11-15 07:34:34 +00:00
PyTorch MergeBot	252e68a83b	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `54493fe8c4`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))	2023-11-15 00:51:23 +00:00
Antonio Kim	54493fe8c4	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-13 23:18:14 +00:00
Justin Chu	47a59ee4d1	[ONNX] Update exporter issue report instructions for quantized models (#113494 ) Update the instructions to point users to the right place for creating issues. https://github.com/onnx/onnx/issues/5674#issuecomment-1806505240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113494 Approved by: https://github.com/jerryzh168	2023-11-13 18:18:19 +00:00
Bin Bao	c197c48ceb	[aotinductor] Add a demo tutorial (#112457 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112457 Approved by: https://github.com/msaroufim, https://github.com/albanD	2023-11-10 17:01:03 +00:00
Thiago Crepaldi	574e313643	Add thiagocrepaldi as person of interest for onnx exporter (#113402 ) @malfet @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113402 Approved by: https://github.com/malfet	2023-11-10 15:19:58 +00:00
Sergii Dymchenko	bb06725ee0	Update mentions of deprecated functions if complex_numbers.rst (#113391 ) `torch.svd` is deprecated, and `torch.solve` is completely removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113391 Approved by: https://github.com/malfet, https://github.com/lezcano	2023-11-09 22:32:26 +00:00
Jerry Zhang	501d118255	[quant][pt2e] Add transform_for_annotation method in Quantizer (#113115 ) Summary: Adding the method so that people can do some transformations before annotation to make the graph easier to annotate Test Plan: python test/test_quantization.py TestQuantizePT2E.test_transform_for_annotation Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D51141080](https://our.internmc.facebook.com/intern/diff/D51141080) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113115 Approved by: https://github.com/kimishpatel	2023-11-09 20:23:29 +00:00
Nikita Shulga	81bf0bd68d	[no ci] Fix typo in `persons_of_interest.rst` (#113283 ) There are no `c` in `Hirsh` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113283 Approved by: https://github.com/bdhirsh	2023-11-08 19:36:32 +00:00
Edward Z. Yang	1f3fa13f0a	Handle unbacked SymInt sized outputs in AOTAutograd (#113159 ) Thanks aakhundov for constructing the test case. This PR was constructed by running the failing test case, and then fixing problems until we got all the way to the end. There are a few distinct fixes: * AOTAutograd performs equality tests on tensor metadata to determine if a metadata mutation had occurred. If we test i0 vs i1, we should report these are NOT equal, since obviously we have somehow resized the tensor from i0 to i1 (even if, on a particular run, it is possible i0 == i1). * There's a sketchy fix for `test_aot_autograd_exhaustive_matmul_cpu_float32` where we check if the output shape equals the tangent shape. Unfortunately, the same `definitely_true` treatment does not work here, it still fails on the example. I piled an extra sketchy fix on top of it, where I just try my best to avoid doing the view. Maybe we should have some sort of logging here. * Partitioner needs to get out a size for unbacked SymInt when partitioning. I just feed it a random heuristic value in this case, similar to how we've been dealing with this in Inductor. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113159 Approved by: https://github.com/aakhundov, https://github.com/bdhirsh	2023-11-08 04:28:38 +00:00
PyTorch MergeBot	9a28a7b498	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `27e31ab6e8`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))	2023-11-07 15:53:32 +00:00
Thiago Crepaldi	eefe327b11	Rename torch.onnx.ExportOutput* to ONNXProgram* (#112263 ) Since PyTorch 2.1, torch.export API was introduced and the term "export" got overloaded due to the already existing torch.onnx.export API. The torch.onnx.dynamo_export API was introduced on pyTorch 2.0 and it exposed a torch.onnx.ExportOutput which now can be confused with torch.export.export output To prevent such ambiguity and standardize names around the new torch.export.ExportedProgram, this PR renames torch.onnx.ExportOutput to torch.onnx.ONNXProgram Pull Request resolved: https://github.com/pytorch/pytorch/pull/112263 Approved by: https://github.com/BowenBao ghstack dependencies: #112444	2023-11-06 22:27:15 +00:00
Antonio Kim	27e31ab6e8	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-06 21:27:02 +00:00
Peter Bell	718035791d	Prefer `e.is_number` over `not e.free_symbols` in SymPy (#112688 ) We spend somewhere on the order 1% in `sympy.Expr.free_symbols` as it is called millions of times. Most of the time we actually just want to know "is this a constant", however `e.is_constant()` is horribly slow. It turns out though that there is another propery `is_number` that does what we want. > property is_number: > > Returns True if self has no free symbols and no undefined functions (AppliedUndef, to be precise). It will be faster > than if not self.free_symbols, however, since is_number will fail as soon as it hits a free symbol or undefined > function. Even further, we also avoid the overhead of building the unnecessary set object. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112688 Approved by: https://github.com/lezcano	2023-11-06 20:05:13 +00:00
Chien-Chin Huang	9d0c3e21d0	[state_dict][9/N] Add get and set APIs for model and optimizer state_dict (#112203 ) The original get_state_dict and set_state_dict pair is too complicated because of the possible combinations of usages. This PR adds the APIs to get/set model_state_dict and optimizer_state_dict seperately. Differential Revision: [D50713584](https://our.internmc.facebook.com/intern/diff/D50713584/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112203 Approved by: https://github.com/wz337 ghstack dependencies: #112167	2023-11-02 22:03:57 +00:00
Zhengxu Chen	50767a075a	[export] Clean up verifier [1/n]. (#112505 ) Summary: Some adjustments to verifier so that it's easier to use it correctly. We will enable verifier later, so the current diff is no-op. Test Plan: CI Differential Revision: D50839295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112505 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2023-11-02 19:36:06 +00:00
Jerry Zhang	6929ebf2b0	[quant][docs] Add x86 inductor quant docs (#112648 ) Summary: att Test Plan: . Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/112648 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/andrewor14	2023-11-02 17:02:09 +00:00
Edward Z. Yang	09df6b771b	Add a note about performant record_stream use. (#112526 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/112526 Approved by: https://github.com/albanD	2023-11-02 15:50:22 +00:00

... 5 6 7 8 9 ...

2939 Commits