pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Tongzhou Wang	faab564bda	[doc] Fix grammar in export.ir_spec.rst (#139584 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139584 Approved by: https://github.com/zou3519	2024-11-05 23:26:36 +00:00
Ryan Guo	693a0a1bd4	[dynamo][NFC] Rename `mutable_local` and add documentation (#139339 ) This patch addresses the renaming part of #133027, specifically, it renames the following and adds documentation for relevant classes. 1. `VariableTracker.mutable_local` to `mutation_type` 2. `MatableLocal `to `ValueMutationNew` 3. `MutableSideEffects `to `ValueMutationExisting` 4. `MutableLocalSource` to `SourceType` 5. `MutableLocalSource.Local` to `New` Note that (2), (3) and (5) are mainly to bring consistency between them and `AttributeMutationNew`, `AttributeMutationExisting`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305	2024-11-05 19:11:41 +00:00
Henry Tsang	350bc2a166	[export] Add support for symbool to make it usable for torch.cond (#138765 ) # Why? I want the following code to work. minimal repro: ``` class M(torch.nn.Module): def forward(self, dilate_flag): return dilate_flag.item() input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),) model = M().cuda() ep = torch.export.export(model, input1, strict=True) path = torch._inductor.aot_compile(ep.module(), input1) aot_model = torch._export.aot_load(path, device="cuda") actual_output = aot_model(input1) ``` error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program second error will be handled by https://github.com/pytorch/pytorch/pull/138760 # Motivation I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work. ``` class M(torch.nn.Module): def __init__(self) -> None: super().__init__() self.dilate_flag = 0 def forward(self, dilate_flag): self.dilate_flag = dilate_flag.item() def true_fn(dilate_flag): return dilate_flag.clone() def false_fn(dilate_flag): return dilate_flag.clone() torch.cond( self.dilate_flag, true_fn, false_fn, (dilate_flag,), ) return self.dilate_flag input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),) input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),) inputs = (input1, input2) model = M().cuda() for input in inputs: expected_output = model(input) ep = torch.export.export(model, input, strict=False) path = torch._inductor.aot_compile(ep.module(), input) aot_model = torch._export.aot_load(path, device="cuda") actual_output = aot_model(*input) assert ( expected_output == actual_output ), f"henry they are not equal {expected_output} != {actual_output}" ``` Differential Revision: D64867504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765 Approved by: https://github.com/ydwu4	2024-11-04 23:31:49 +00:00
Jane Xu	514c466cd9	Redirect the custom ops landing page :D (#139634 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139634 Approved by: https://github.com/zou3519	2024-11-04 22:25:15 +00:00
Will Constable	3d93caf664	[c10d] Add thread-safety initialization warning (#139638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139638 Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/XilunWu	2024-11-04 21:38:47 +00:00
Edward Z. Yang	585dbfa583	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-03 06:29:57 +00:00
PyTorch MergeBot	92d7f29e59	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit `f6be44c74e`. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to more fbcode errors ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452985581))	2024-11-02 13:11:04 +00:00
Edward Z. Yang	f6be44c74e	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-02 11:50:11 +00:00
PyTorch MergeBot	8d1eaa3da6	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit `a6630bcf87`. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))	2024-11-02 03:38:15 +00:00
Mikayla Gawarecki	a979318ef7	Add section to serialization note re weights_only (#139433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433 Approved by: https://github.com/malfet ghstack dependencies: #138936, #139221	2024-11-01 21:51:50 +00:00
Edward Z. Yang	a6630bcf87	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-01 21:43:25 +00:00
Mikayla Gawarecki	ea0e09b3f3	Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221 ) Fixes https://github.com/pytorch/pytorch/issues/129698 https://github.com/pytorch/pytorch/pull/139106 without pickletools Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221 Approved by: https://github.com/malfet ghstack dependencies: #138936	2024-11-01 19:31:39 +00:00
bskrlj	8e27833e30	Ensure SWA boundary conditions w.r.t. definition (#133773 ) According to the documentation, decay is a number in [0,1] range,[ i.e.](https://pytorch.org/docs/stable/optim.html) ``` Decay is a parameter between 0 and 1 that controls how fast the averaged parameters are decayed. If not provided to get_ema_multi_avg_fn, the default is 0.999. ``` An inspection of `swa_utils.py` indicates there are no checks for invalid values of `decay`. Adding asserts as suggested in this PR ensures valid compute range (one way to enforce correct behavior, there are perhaps more suitable ones). Papers `torch` cites for reference idea/implementation also consider exclusively this range (e.g., https://arxiv.org/pdf/2310.04415). Fixes https://github.com/pytorch/pytorch/issues/133772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133773 Approved by: https://github.com/janeyx99	2024-10-31 18:24:08 +00:00
Nhat Minh Luu	261d90c18f	Add docs page for `torch.inf` and `torch.nan` (#138430 ) Fixes #131040 ## Description Add docs for `torch.inf` and `torch.nan`, ## Checklist - [x] The issue that is being fixed is referred in the description (see above "Fixes #ISSUE_NUMBER") - [x] Only one issue is addressed in this pull request - [x] Labels from the issue that this PR is fixing are added to this pull request - [x] No unnecessary issues are included into this pull request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138430 Approved by: https://github.com/ezyang	2024-10-31 05:46:46 +00:00
Boyuan Feng	68134a320e	[Flex Attention] Paged Attention (#137164 ) This PR adds paged attention for flex attention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137164 Approved by: https://github.com/drisspg	2024-10-29 17:05:22 +00:00
Jeff Daily	7c7b2d89ba	[ROCm] set hipblas workspace (#138791 ) Fixes #138532. This brings hipblas behavior in line with cublas behavior with respect to setting the workspace to an allocation from the caching allocator as well as the env var HIPBLAS_WORKSPACE_CONFIG. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138791 Approved by: https://github.com/naromero77amd, https://github.com/eqy, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-29 01:37:55 +00:00
Svetlana Karslioglu	e00ead400c	Add a temporary Survey about the search (#139096 ) - Add a link to the new search survey - Add .css classes needed for the search banner Pull Request resolved: https://github.com/pytorch/pytorch/pull/139096 Approved by: https://github.com/seemethere, https://github.com/cjyabraham	2024-10-28 23:43:25 +00:00
Joel Schlosser	8ba9063002	FlexAttention support for NJT (#136792 ) This PR adds FlexAttention + NJT support. In particular: * To handle raggedness, treats the packed sequence dim of input NJTs as a giant "stacked sequence". To ensure user `score_mod` / `mask_mod` functions can still be written in the original NJT sequence space, this PR handles conversions for indices within the giant "stacked sequence" -> sequence relative indices automatically. * Provides `py_impls` for `NestedTensor` to the HOPs for flex attention forward / backward that simply wrap / unwrap NJTs appropriately * Adds barebones `new_empty()` support to NJT since FlexAttention utilizes this repeatedly; right now, only `new_empty()` with a shape of `()` is supported * Tests that FlexAttention with a causal mask matches causal SDPA * Adds a new public API for FlexAttention usage: * `create_nested_block_mask(mask_mod, B, H, njt, BLOCK_SIZE, _compile)` - NJT analogue for `create_block_mask()` that utilizes the `njt`'s ragged structure to create an appropriately-sized block mask (e.g. `(1, 1, total_seqlen, total_seqlen)`). This function handles the index conversion from "stacked sequence" space -> relative sequence space. * Minor note: as this is a public API, this function is purposefully named with "nested" instead of "njt" to keep the latter as an informal, mostly internal-only term. Example usage: ```python def causal_mask(b, h, q_idx, kv_idx): return q_idx >= kv_idx query = ... # NJT of shape (B, H, S, D) key = ... # NJT of shape (B, H, S, D) value = ... # NJT of shape (B, H, S, D) # create_nested_block_mask() automatically converts indices from "stacked sequence" space -> relative sequence space block_mask = create_nested_block_mask(causal_mask, 1, 1, query) # block mask conceptual shape is (B, H, sum(S), sum(S)) output = flex_attention(query, key, value, block_mask=block_mask) def causal_score_mod(score, b, h, q_idx, kv_idx): return torch.where(q_idx >= kv_idx, score, float("-inf")) # flex_attention() automatically converts indices from "stacked sequence" space -> relative sequence space for NJT inputs output2 = flex_attention(query, key, value, score_mod=causal_score_mod) ``` TODO: ~~Determine the right level of abstraction for public API helpers + move them alongside other helpers~~ Verify this with others though * ~~Some cleanup~~ * ~~`njt_score_mod_adapter`~~ * ~~Q: should `create_njt_block_mask()` call `njt_mask_mod_adapter()` so we don't need two calls?~~ * Can we avoid materializing the `sum(s)` length `seq_idx` used for conversion between stacked sequence -> sequence relative indices? * Not for now, although future work may deepen the integration between Flex + NJT (possibly requiring custom templates). We should try to cache this though. * ~~Demonstrate non-causal mask~~ * Support non-contiguous NJTs with holes (booted to future PR) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136792 Approved by: https://github.com/drisspg ghstack dependencies: #138841	2024-10-28 20:01:27 +00:00
Wouter Devriendt	bae3426af7	reimport pr137735 due to merging check issues (#138959 ) This is a cherry-pick from #137735 by @mikaylagawarecki , that cannot be merged due to a (wrongly) failing check for codev @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/138959 Approved by: https://github.com/mikaylagawarecki	2024-10-27 16:31:34 +00:00
Yu, Guangye	40c098f731	Introduce a device-agnostic runtime API design (#132204 ) # Motivation According to [[RFC]A device-agnostic Python runtime API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/128403), this PR intends to introduce a device-agnostic runtime API design. I personally prefer the Simple Version APIs that no longer accept the device type as an input argument. It means we will leverage `getAccelerator` to fetch the current accelerator. And it is flexible to expand these APIs to handle multiple types of accelerator scenarios. The design does NOT break the previous design philosophies. I also believe that namespace torch.accelerator is better. It lets users know that the APIs they are calling are running on an accelerator rather than CPU. This is important. Meanwhile, we can follow a simple API design principle: 1. Device-agnostic APIs should be placed under the torch.accelerator namespace and not accept a device_type optional parameter. 2. Device-specific APIs should be placed under device-specific submodules. 3. APIS required by both CPU and accelerators should be placed under the torch namespace and accept a device_type optional parameter. Also, I list the pros and cons of Simple Version here: Pros: - `torch.accelerator.foo` will have the same input argument as `torch.xxx.foo`, bringing a better user experience; - more concise, facilitate the developer to write a device-agnostic code. Cons: - no obvious drawbacks. # Additional Context I list the new APIs here: ```python torch.accelerator.is_available() -> bool: torch.accelerator.current_accelerator() -> torch.device: torch.accelerator.device_count() -> int: torch.accelerator.current_device_idx() -> int: torch.accelerator.set_device_idx(device: Union[torch.device, str, int, None]) -> None: torch.accelerator.current_stream(device: Union[torch.device, str, int, None]) -> torch.Stream: torch.accelerator.set_stream(stream: torch.Stream) -> None: torch.accelerator.synchronize(device: Union[torch.device, str, int, None]) -> None: ``` According to the discussion with Alban, we decide to change the API name `set_device` to `set_device_idx` and `current_device` to `current_device_idx` for more explicit. And will submit other PR to support device and stream context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132204 Approved by: https://github.com/EikanWang, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/albanD	2024-10-27 10:37:09 +00:00
Laith Sakka	ed313a5ca2	Introduce torch.sym_add, variadic add (#138660 ) Tested internally here: https://www.internalfb.com/diff/D64057744 This is a reland after previous internal failures. main change is ``` if min is None and max is None: torch._check_is_size(size) return ``` Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138660 Approved by: https://github.com/ezyang, https://github.com/bobrenjc93	2024-10-23 17:42:41 +00:00
Laith Sakka	662d07e93e	Remove parallel_and and parallel_or (#138135 ) Not used, suggested by @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/138135 Approved by: https://github.com/ezyang	2024-10-23 00:22:22 +00:00
Nikita Shulga	d1be61ce4e	Update copyrights to 2024 (#138638 ) Spiritual successor of https://github.com/pytorch/pytorch/pull/119413 + CPP docs copyright update as well Fixes https://github.com/pytorch/pytorch/issues/138630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138638 Approved by: https://github.com/atalman	2024-10-22 21:00:58 +00:00
Syed Tousif Ahmed	03c72976a5	Properly uses ref-counting for torch.cuda.use_mem_pool (#133600 ) This PR refactors some ref-counting functionality out of `beginAllocateToPool` and `releasePool`. The ref-counting logic is then used in construction and destruction of `torch.cuda.MemPool`. The `use_count` variable in the CUDACachingAllocator is essentially a refcount of how many context managers are using the pool. Since we are now lifting up the MemPool abstraction to the user, the MemPool object itself now needs to hold a an extra reference as well. Part of https://github.com/pytorch/pytorch/issues/124807. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133600 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-10-22 03:21:53 +00:00
Mikayla Gawarecki	e24871eb3c	Add environment variable to force no weights_only load (#138225 ) In preparation for `weights_only` flip, if users don't have access to the `torch.load` call Pull Request resolved: https://github.com/pytorch/pytorch/pull/138225 Approved by: https://github.com/albanD	2024-10-21 23:26:15 +00:00
Justin Chu	c6609ece84	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms, https://github.com/xadupre ghstack dependencies: #137789	2024-10-21 18:17:48 +00:00
Tugsbayasgalan Manlaibaatar	1f32a1fb80	Replace torch.export default decomp table to be lazily populated (#137650 ) In this PR, we implement lazy dictionary for export decomp behaviour for following reasons: 1. Custom op loading can happen after import time, as a result, the decomp table might not be able to pick up the decomp. Therefore we try to delay materialization as late as possible. I intentionally seperated out the core_aten_decomp to not have any custom CIA ops in this PR to mitigate the risk of getting reverted but in the future, core_aten_decomp under torch/_decomp will exist as an alias to official export table (torch.export.default_decompositions) Differential Revision: [D64140807](https://our.internmc.facebook.com/intern/diff/D64140807) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137650 Approved by: https://github.com/justinchuby, https://github.com/bdhirsh	2024-10-18 19:28:52 +00:00
Svetlana Karslioglu	9c2a80322a	Add Programmable Google Search (#137716 ) - Adding the code for the programmable Google search - Adding the CSS overrides. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137716 Approved by: https://github.com/seemethere, https://github.com/albanD Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2024-10-18 18:18:16 +00:00
ErezYosef	5a81475884	Documentation Update: Fix Missing Whitespace in Optimizer Docs (#138321 ) ### Description: This PR addresses a minor [formatting issue identified in a previous contribution to the Optimizer documentation](https://github.com/pytorch/pytorch/pull/134107#discussion_r1800833948). Specifically, it fixes the missing whitespace after `param_names` in the section on utilizing named parameters to load the optimizer state dict. You can find the related docs here: [Optimizer Documentation](https://pytorch.org/docs/main/optim.html#how-to-utilize-named-parameters-to-load-optimizer-state-dict). @janeyx99 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138321 Approved by: https://github.com/janeyx99	2024-10-18 15:41:43 +00:00
Yu, Guangye	8cda774a03	Add torch.xpu.get_arch_list and torch.xpu.get_gencode_flags for XPU (#137773 ) # Motivation Add `torch.xpu.get_arch_list()` and `torch.xpu.get_gencode_flags()` methods that return architecture list and AOT flags to preserve what flags PyTorch XPU was built with. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137773 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-10-18 02:28:08 +00:00
Zheng, Zhaoqiong	7ba706c74e	update get start xpu (#137479 ) 1. respect the comment from the community, downgrade the "Beta" to "Prototype" for the first xpu release with wheel 2. add wheels installation of torchaudio & torchvision for nightly on Windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/137479 Approved by: https://github.com/atalman, https://github.com/malfet	2024-10-16 17:36:29 +00:00
PyTorch MergeBot	dd32a32cb6	Revert "Expose option to disable CRC-32 computation during `torch.save` (#137735 )" This reverts commit `534fa96f2d`. Reverted https://github.com/pytorch/pytorch/pull/137735 on behalf of https://github.com/clee2000 due to failing internally D64438525, probably needs gating ([comment](https://github.com/pytorch/pytorch/pull/137735#issuecomment-2417412264))	2024-10-16 17:03:06 +00:00
William Wen	4c8718d8e7	[dynamo] add torch.compiler.set_stance (#137504 ) Attempt # 2 at https://github.com/pytorch/pytorch/pull/132926 to implement https://github.com/pytorch/pytorch/issues/123771. Implement a new `torch.compiler.set_stance` function that can force `torch.compile` regions to run eagerly. See added tests for usage examples. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137504 Approved by: https://github.com/yf225, https://github.com/jansel	2024-10-16 16:18:25 +00:00
Howard Huang	75109682b6	[Pipelining] Refactor Interleaved1F1B and ZeroBubble (#137783 ) NOTE: this PR removes `ScheduleFlexibleInterleaved1F1B`, let me know if theres any concerns. `ScheduleFlexibleInterleaved1F1B` is a superset of `Interleaved1F1B` and uses most of the same implementation, but relaxes the condition that `n_microbatches % pp_size == 0`. This is refactors the implementation into `Interleaved1F1B` and then removes it since it is confusing to have both schedules with similar names. This also refactors the zero bubble logic to belong in the `ZeroBubble` schedule class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137783 Approved by: https://github.com/wconstab	2024-10-16 03:05:14 +00:00
Jane Xu	eaec72d1e6	Link directly to new Custom Ops Landing Page (#137933 ) e.g., click on first link in https://docs-preview.pytorch.org/pytorch/pytorch/137933/library.html#testing-custom-ops Pull Request resolved: https://github.com/pytorch/pytorch/pull/137933 Approved by: https://github.com/zou3519	2024-10-15 21:18:21 +00:00
Mikayla Gawarecki	534fa96f2d	Expose option to disable CRC-32 computation during `torch.save` (#137735 ) Option only works in open source, not internal Pull Request resolved: https://github.com/pytorch/pytorch/pull/137735 Approved by: https://github.com/albanD	2024-10-15 19:30:02 +00:00
PyTorch MergeBot	2831af39c4	Revert "[ONNX] Remove deprecated export_to_pretty_string (#137790 )" This reverts commit `d0628a7e39`. Reverted https://github.com/pytorch/pytorch/pull/137790 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/137789#issuecomment-2414632100))	2024-10-15 17:40:06 +00:00
Alex Baden	39d21ed803	[Inductor] Update AttrsDescriptor instantiation for Triton changes (#137458 ) The `AttrsDescriptor` class has been present in Triton for almost a year now (introduced [here](`72c9833927`)), so we should be able to rely on it existing. I am in the process of supporting the new `AttrsDescriptor` class and @jansel suggested I split changes to the existing class out separately to make sure nothing breaks removing the legacy attribute descriptor attributes. Initially I attempted to remove the branching around detecting whether `AttrsDescriptor` exists but that breaks because PyTorch must build without Triton. So, I went back and updated for the naming introduced in the commit linked above, and also removed two unused attributes `divisible_by_8` and `ids_to_fold` which were removed in Feb 2024 (https://github.com/triton-lang/triton/pull/3122 and https://github.com/triton-lang/triton/pull/3080 respectively). With these changes only the internal workings of the `AttrsDescriptor` class will differ between supported Triton versions, but the data stored will remain consistent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137458 Approved by: https://github.com/jansel	2024-10-14 20:20:29 +00:00
ErezYosef	197601eeea	Add Support for Tracking Parameter Names (named_parameters) in Optimizer State Dict (#134107 ) A proposal addressing Issue #1489: Optimizer should track parameter names and not id. (also mentioned in here: [[RFC] Introducing FQNs/clarity eyeglasses to optim state_dict](https://dev-discuss.pytorch.org/t/rfc-introducing-fqns-clarity-to-optim-state-dict/1552) ## Summary This PR introduces a backward-compatible enhancement where optimizers track parameter names instead of just their id. Optimizers can be initialized with `named_parameters()` as: ```python optimizer = optim.SGD(model.named_parameters(), lr=0.01, momentum=0.9) ``` This allows for greater clarity and ease when handling optimizers, as the parameters' names are preserved within the optimizer’s `state_dict` as: ``` state_dict = { 'state': { 0: {'momentum_buffer': tensor(...), ...}, 1: {'momentum_buffer': tensor(...), ...}, }, 'param_groups': [ { 'lr': 0.01, 'weight_decay': 0, ... 'params': [0,1] 'param_names' ['layer.weight', 'layer.bias'] (optional) } ] } ``` Loading `state_dict` is not changed (backward-compatible) and the `param_names` key will be ignored. ## Key Features #### Named Parameters in Optimizer Initialization: Optimizers can accept the output of `model.named_parameters()` during initialization, allowing them to store parameter names directly. #### Parameter Names in `state_dict`: The parameter names are saved as a list in the optimizer’s `state_dict` with key `param_names`, alongside the `params` indices, ensuring seamless tracking of both names and parameters. ## Backward Compatibility #### No Breaking Changes: This change is fully backward-compatible. The added `param_names` key in the optimizer's `state_dict` is ignored when loading a state to the optimizer. #### Customization with Hooks: For more control, the loaded state_dict can be modified using a custom `register_load_state_dict_pre_hook`, providing flexibility for different design needs. ## Documentation Updates Please refer to the documentation changes for more details on how this feature is implemented and how it can be used effectively. ## Solution Example: A suggested solution to the problem mentioned in #1489, for the same parameters but in a different order. The following `register_load_state_dict_pre_hook` should be added to the optimizer before loading to enable loading the state dict : ```python def adapt_state_dict_ids(optimizer, state_dict): # assuming a single param group. current_state_group = optimizer.state_dict()['param_groups'][0] loaded_state_group = state_dict['param_groups'][0] # same number of params, same names, only different ordering current_state_name_to_id_mapping = {} # mapping -- param_name: id for i, name in enumerate(current_state_group['param_names']): current_state_name_to_id_mapping[name] = current_state_group['params'][i] # changing the ids of the loaded state dict to match the order of the given state dict. for i, name in enumerate(current_state_group['param_names']): loaded_state_group['params'][i] = current_state_name_to_id_mapping[name] return state_dict ``` In this code, the loaded `state_dict` ids are adapted to match the order of the current optimizer `state_dict`. Both the previous and the current optimizers are required to be initiated with `named_parameters()` to have the 'param_names' key in the dict. ### Note This is my first contribution to PyTorch, and I wish to receive feedback or suggestions for improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134107 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-10-14 19:24:44 +00:00
Justin Chu	d0628a7e39	[ONNX] Remove deprecated export_to_pretty_string (#137790 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137790 Approved by: https://github.com/titaiwangms ghstack dependencies: #137789	2024-10-11 20:10:04 +00:00
Jiong Gong	e30c55ee52	Update maintainers for inductor and x86 CPU (#136839 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136839 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/malfet	2024-10-11 07:24:07 +00:00
Jin Zhou	5516ac5c21	[ROCm] Tunableop record untuned (#128813 ) When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference. So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp: - record untuned GEMMs to file. - a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813 Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-10-09 21:59:03 +00:00
Jane Xu	cfe970260a	Clarify opt-einsum usage, fix #127109 (#137596 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596 Approved by: https://github.com/albanD	2024-10-09 20:31:24 +00:00
PyTorch MergeBot	16a2c2cfd4	Revert "Introduce torch.sym_sum (#136429 )" This reverts commit `90bed32b98`. Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))	2024-10-09 20:08:01 +00:00
Edward Z. Yang	90bed32b98	Introduce torch.sym_sum (#136429 ) Partially addresses https://github.com/pytorch/pytorch/issues/128150 When you have big sums of values, we end up computing long chains of binary addition in our FX graph representation. Not only is this ugly, it also is quadratic, as the sympy.Add constructor is O(N) in number of arguments. Instead, ensure that we maintain the summation as a single FX node so we can do the entire addition all in one go. update_hint_regression benchmark, before and after: ``` update_hint_regression,compile_time_instruction_count,2648328980 update_hint_regression,compile_time_instruction_count,2563748678 ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429 Approved by: https://github.com/isuruf	2024-10-08 18:12:57 +00:00
Michael Lazos	22e19bd2d7	Add link to torch.compile the missing manual in troubleshooting (#137301 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/137301 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <svekars@meta.com>	2024-10-04 18:19:30 +00:00
Jeff Daily	c7b0d4b148	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-04 15:36:29 +00:00
PyTorch MergeBot	0d1701f310	Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 )" This reverts commit `7001907480`. Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))	2024-10-03 06:22:55 +00:00
Xilun Wu	54f50f19eb	[dtensor][experimental] expose DTensor Context Parallel API (#137038 ) Summary expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038 Approved by: https://github.com/wz337, https://github.com/fegin	2024-10-02 18:00:23 +00:00
Jeff Daily	7001907480	raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114 ) raw_alloc is used by cudnn, miopen, thrust, and tunableop. Without this PR, the env var for disabling the caching allocator will only partially work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114 Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD Co-authored-by: Nichols A. Romero <nick.romero@amd.com>	2024-10-02 16:27:15 +00:00
Nikita Shulga	76a57568de	Update windows maintainers (#136901 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136901 Approved by: https://github.com/albanD	2024-09-30 16:12:49 +00:00
albanD	2421344d8f	Update current maintainers (#136672 ) This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info. The main rules we followed: - No code contributor is removed, they're all placed as emeritus - Breakdown too big categories to make this document useful to know who to ping - No category where the code is still in the codebase is removed - We did not rework the categories (for example to be closer to module: labels) and leave that for later - All non-emeritus names are ordered by their number of comments on issues related to their topic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672 Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet	2024-09-26 17:13:16 +00:00
Zheng, Zhaoqiong	f3dd1721f4	[Update] Update note for Getting Started with PyTorch on Intel GPUs (#129946 ) remove the hardware and software prerequisites and set up env part. keep the prerequisites section and link to pytorch prerequistes for intel gpus for driver install, intel support package install and env set up https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html Update the support for Intel Client GPU MTL-H Update inference & training examples Pull Request resolved: https://github.com/pytorch/pytorch/pull/129946 Approved by: https://github.com/seemethere	2024-09-26 00:22:05 +00:00
Jokeren	cabfbef6cf	[pytorch][PR] [inductor] More fixes on the keys of `constants` and `signature` dictionaries (#136514 ) Summary: Previous PR forgets to change two other places that also create `constants` and `signature`. Test Plan: Imported from GitHub, without a `Test Plan:` line. {F1884584338} Differential Revision: D63027728 Pulled By: Myrthan Co-authored-by: Jokeren <robinho364@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514 Approved by: https://github.com/jansel Co-authored-by: Jokeren <robinho364@gmail.com>	2024-09-25 09:34:14 +00:00
Jianyu Huang	0a35986cdb	Add option to configure reduced precision math backend for SDPA (#135964 ) Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA. Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels Differential Revision: D62625515 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964 Approved by: https://github.com/jbschlosser	2024-09-24 07:11:38 +00:00
Sergii Dymchenko	d9aca9914b	Remove duplicated words in library.rst (#136340 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136340 Approved by: https://github.com/svekars	2024-09-20 03:30:54 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
Trung Truong	cc365fdd7b	[MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889 ) Summary: Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn At the moment, both the major and minor version are just 0 Test Plan: Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/ Differential Revision: D62595296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889 Approved by: https://github.com/egienvalue	2024-09-17 17:42:56 +00:00
Nikita Shulga	38caf10411	[EZ] Fix spelling typo (#136157 ) s/toosl/tools/ (spotted by @louie-tsai) Also, capitalize CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157 Approved by: https://github.com/kit1980	2024-09-16 19:30:30 +00:00
PyTorch MergeBot	0199fd4d7e	Revert "[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 )" This reverts commit `e54b559e88`. Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))	2024-09-16 17:58:02 +00:00
Howard Huang	e501ed71d4	Update link in distributed.tensor.parallel.rst (#136103 ) dtensor folder was moved Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-09-15 19:36:29 +00:00
Tugsbayasgalan Manlaibaatar	dec3403b24	Add some doc for export_for_training (#135918 ) Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080, #135912	2024-09-15 17:08:12 +00:00
Tugsbayasgalan Manlaibaatar	1904b09e61	Create export_for_inference API and expose core_aten as public facing API (#135912 ) Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080	2024-09-15 17:05:07 +00:00
Justin Chu	e2d3af405f	[ONNX] Remove logging apis from public (#133825 ) Remove - torch.onnx.enable_log - torch.onnx.disable_log - torch.onnx.set_log_stream - torch.onnx.log Because they are not meant for public consumption and has been marked for deprecation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825 Approved by: https://github.com/titaiwangms	2024-09-13 22:19:52 +00:00
CaoE	2f53d570fe	Update document for autocast on CPU (#135299 ) Update document for autocast on CPU due to the support of float16 and changes in the operator list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars	2024-09-13 09:11:47 +00:00
Jokeren	e54b559e88	[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 ) Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406 Approved by: https://github.com/jansel	2024-09-13 04:10:41 +00:00
Xavier Dupré	5e145861f2	[ONNX] Improves documentation of ONNX exporter (#135372 ) The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 15:09:01 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Yu, Guangye	b53d97c7be	[Intel GPU] Add XPU memory-related APIs (#129919 ) # Motivation According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification. But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification. # Additional Context Fixes: #127929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919 Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #130923	2024-09-07 11:15:17 +00:00
Justin Chu	a6b9d444fb	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-07 00:50:15 +00:00
PyTorch MergeBot	a681260caf	Revert "[ONNX] Refactor exporter errors (#135180 )" This reverts commit `5eebd9315a`. Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](`5eebd9315a`), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))	2024-09-06 21:39:18 +00:00
Justin Chu	5eebd9315a	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-06 19:10:56 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
titaiwangms	28ccfba248	[ONNX] Delete ONNXProgramSerializer (#135261 ) Fixes #135182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261 Approved by: https://github.com/justinchuby	2024-09-05 23:52:51 +00:00
Mikayla Gawarecki	a096f2899d	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-09-05 16:53:39 +00:00
Animesh Jain	32f45f01a9	[dynamo] Retire CompileProfiler (#135133 ) Fixes confusion in https://github.com/pytorch/pytorch/issues/113443 We have TORCH_LOGS that supersedes CompileProfiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/135133 Approved by: https://github.com/ezyang ghstack dependencies: #135039, #135121, #135129, #135130	2024-09-05 01:08:40 +00:00
Svetlana Karslioglu	0d193a0adf	Add ExecuTorch warning to mobile_optimizer (#134697 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/134697/mobile_optimizer.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/134697 Approved by: https://github.com/ali-khosh, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-04 17:47:14 +00:00
PyTorch MergeBot	2fd36086bc	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit `94db935749`. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/kit1980 due to See D62082697 ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2327542276))	2024-09-03 22:21:27 +00:00
Mikayla Gawarecki	94db935749	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-29 04:52:52 +00:00
Syed Tousif Ahmed	4655eb3ee2	Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685 ) Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685 Approved by: https://github.com/ezyang	2024-08-29 03:56:31 +00:00
PyTorch MergeBot	503c0dd923	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit `b6e51711a0`. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/ZainRizvi due to Actually, seems like it was this commit that introduced the failure: test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604690725/job/29392898277) [HUD commit link](`b6e51711a0`) ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2316554188))	2024-08-29 01:42:52 +00:00
PyTorch MergeBot	1285443994	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit `202600bc23`. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/mikaylagawarecki due to This is breaking Windows docs tests due to NamedTemporaryFile on Windows not working well ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2316543901))	2024-08-29 01:30:49 +00:00
Avik Chaudhuri	ca03a14cf7	hang dim hint constants off Dim (#134702 ) Summary: Retry landing https://github.com/pytorch/pytorch/pull/134484 Test Plan: (see original) Differential Revision: D61925860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134702 Approved by: https://github.com/pianpwk	2024-08-29 01:02:01 +00:00
Mikayla Gawarecki	202600bc23	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-28 23:53:17 +00:00
PyTorch MergeBot	f997b2b8e6	Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )" This reverts commit `f685018ea9`. Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](`f685018ea9`) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))	2024-08-28 23:10:07 +00:00
Nowtryz	f685018ea9	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-08-28 21:30:39 +00:00
Nowtryz	b6e51711a0	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-28 21:28:23 +00:00
PyTorch MergeBot	13d40f6fc5	Revert "hang dim hint constants off Dim (#134484 )" This reverts commit `c142af7209`. Reverted https://github.com/pytorch/pytorch/pull/134484 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134484#issuecomment-2315749549))	2024-08-28 16:05:42 +00:00
Avik Chaudhuri	c142af7209	hang dim hint constants off Dim (#134484 ) Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things. Test Plan: existing Differential Revision: D61807361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484 Approved by: https://github.com/angelayi	2024-08-28 14:35:40 +00:00
Jerry Zhang	3ef4c27ab3	Update pt2e numeric debugger to use node.meta["custom"] field (#134040 ) Summary: With https://github.com/pytorch/pytorch/pull/131912 we now have a "custom" field in node.meta that can be preserved in * copy/deepcopy * run_decompositions() * serialization * re-exporting So we refactored numeric debugger to use this. Test Plan: python test/test_quantization.py TestNumericDebugger Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134040 Approved by: https://github.com/tarun292	2024-08-27 19:51:03 +00:00
Tianyi Tao	7af38eb98b	Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307 ) Fixes #128264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130307 Approved by: https://github.com/soulitzer	2024-08-25 22:14:02 +00:00
Yiming Zhou	2cfc2da527	[export] Make move_to_device_pass function public (#134263 ) Summary: This is a follow-up of https://github.com/pytorch/pytorch/pull/133660 Here we make the `move_to_device_pass()` function publich so users can call it by `from torch.export.passes import move_to_device_pass` Test Plan: CI Differential Revision: D61671310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134263 Approved by: https://github.com/angelayi	2024-08-23 23:18:30 +00:00
Pian Pawakapan	8ff3a5be1b	[export] basic auto dynamic shapes (#133620 ) Starter version of automatic dynamic shapes for export. Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining". The usage for specifying `dynamic_shapes` is now: ``` AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static None/int/STATIC -> static Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified. ``` Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic: - specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None}) ``` Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic: - The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related ``` Implementation details: This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints. For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}` This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are: ``` None/unspecified: dynamic by default Dim/DerivedDim: also a strict assertion ``` If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620 Approved by: https://github.com/avikchaudhuri	2024-08-23 22:56:39 +00:00
Avik Chaudhuri	b454c51060	remove dynamic_dim (#134211 ) Summary: As promised in https://github.com/pytorch/pytorch/pull/134045. Test Plan: existing Differential Revision: D61646937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134211 Approved by: https://github.com/angelayi	2024-08-23 04:13:03 +00:00
Howard Huang	108a75b454	[PP] Add ZeroBubble schedule (#133467 ) Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467 Approved by: https://github.com/wconstab ghstack dependencies: #132691	2024-08-22 13:32:15 +00:00
Zitong Zhan	90c821814e	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 ) This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve. Fixes #69538 Minimum example of usage: ``` import torch if __name__ == '__main__': spd = torch.rand(4, 3) A = spd.T @ spd b = torch.rand(3).to(torch.float64).cuda() A = A.to_sparse_csr().to(torch.float64).cuda() x = torch.linalg.solve(A, b) print((A @ x - b).norm()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn Co-authored-by: Zihang Fang <zhfang1108@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-08-22 07:57:30 +00:00
Jesse Cai	255cd75a97	[sparse] Add cuSPARSELt as a backend (#128534 ) Summary: This PR adds in cuSPARSELt as a backend to PyTorch. It is now possible to see if cuSPARSELt is available and the version if it is with ``` torch.backends.cusparselt.is_available() torch.backends.cusparselt.version() ``` Test Plan: ``` python test/test_sparse_semi_structured.py -k test_cusparselt_backend ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534 Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed	2024-08-21 22:06:07 +00:00
Xuehai Pan	022cd7c9aa	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-21 06:36:41 +00:00
Justin Chu	e8fc1e0118	[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 ) 1/n PR to - Move code from torch-onnx from commit `395495e566` into torch.onnx and fixes imports. - Integrate the new export logic with the torch.onnx.export API and include basic set of tests. - Refactor the API for the change. - Improve documentation. Next PRs will be more tests and docs. Fix https://github.com/pytorch/pytorch/issues/129277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-21 01:08:42 +00:00
Sahdev Zala	06cc2e83f0	Make optim.swa.util content accessible from the torch.optim doc (#133393 ) Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc. Currently, if you click the link, https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`. Also, `torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393 Approved by: https://github.com/janeyx99	2024-08-21 00:43:46 +00:00
PyTorch MergeBot	15b5a0b67f	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit `71dd52f51a`. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
Xuehai Pan	71dd52f51a	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-20 19:48:57 +00:00
PyTorch MergeBot	2bd02e0c82	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit `641724ed1d`. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
PyTorch MergeBot	68570fca69	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit `8de56e2958`. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/jeanschmidt due to Introduced API linting errors ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2298270307))	2024-08-20 08:26:28 +00:00
Xuehai Pan	641724ed1d	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-19 22:14:33 +00:00
nowtryz	8de56e2958	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-19 22:13:59 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit `2ee6b97464`. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
Mikayla Gawarecki	018e48c337	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 ) Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489 Approved by: https://github.com/albanD	2024-08-15 17:11:52 +00:00
Sahdev Zala	19270cff61	Add a reference for the LRScheduler class (#133243 ) The `LRScheduler` class provides methods to adjusts the learning rate during optimization (as updated in this PR). Also, as a note, all the classes of lr_scheduluer are already provided in the `How to adjust learning rate` section. Fixes #127884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133243 Approved by: https://github.com/janeyx99	2024-08-13 16:20:22 +00:00
fduwjj	dc8bb2636c	[c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132920 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-08-09 21:08:20 +00:00
Edward Z. Yang	1f66487c69	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh	2024-08-08 23:07:23 +00:00
PyTorch MergeBot	d1f73fd844	Revert "[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 )" This reverts commit `902c6f3a19`. Reverted https://github.com/pytorch/pytorch/pull/132770 on behalf of https://github.com/ezyang due to Removed API was recommitted ([comment](https://github.com/pytorch/pytorch/pull/132770#issuecomment-2275749689))	2024-08-08 12:54:34 +00:00
Edward Z. Yang	902c6f3a19	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062, #132767, #132769	2024-08-08 12:03:25 +00:00
Edward Z. Yang	aec6332356	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-08 12:03:06 +00:00
Edward Z. Yang	361db32d47	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-08 12:02:54 +00:00
daitian1995	aff48f7378	Autoselect default device in FSDP construction. (#127609 ) There are still some differences between CUDA and non-CUDA custom devices when construct FSDP because CUDA is selected as the default device. For example, when construct FSDP from CPU model and device_id is not passed, device_handle will choose CUDA as default device. This PR will autoselect the real device as the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609 Approved by: https://github.com/awgu	2024-08-08 05:25:17 +00:00
PyTorch MergeBot	a9ff190867	Revert "Consolidate SymDispatchMode into ProxyTensorMode (#132674 )" This reverts commit `ffdf48e63b`. Reverted https://github.com/pytorch/pytorch/pull/132674 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))	2024-08-07 18:25:33 +00:00
PyTorch MergeBot	780310fed7	Revert "Only thunkify proxies in some situations (#132421 )" This reverts commit `bb99008c9e`. Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](`bb99008c9e`). Test got added in `f50621989b` which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))	2024-08-07 15:29:54 +00:00
Edward Z. Yang	bb99008c9e	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-07 11:51:17 +00:00
Edward Z. Yang	ffdf48e63b	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-06 17:03:17 +00:00
Wouter Devriendt	e8645fa2b9	[Doc] fix some typos (found by codespell and typos) (#132544 ) Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544 Approved by: https://github.com/kit1980	2024-08-05 17:21:56 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
Luca Wehrstedt	f4f7aba75d	Expose function to probe whether PyTorch was built with FlashAttention (#131894 ) This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-07-31 11:33:09 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00
Simon Mahns	dcb03106b7	[Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007 ) Summary: as title Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962 Differential Revision: D60335413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007 Approved by: https://github.com/hanzlfs, https://github.com/egienvalue	2024-07-29 20:47:18 +00:00
PyTorch MergeBot	eb9409511e	Revert "support zb1p and zb2p algorithms (#130752 )" This reverts commit `8fe5b93667`. Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](`8fe5b93667`) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))	2024-07-29 12:40:00 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `709ddf7a9d`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
PyTorch MergeBot	b343644f3a	Revert "MTIA equivalent of torch.cuda.memory_stats (#131673 )" This reverts commit `513ce5f69a`. Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))	2024-07-26 00:54:37 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
Simon Mahns	513ce5f69a	MTIA equivalent of torch.cuda.memory_stats (#131673 ) Summary: Adding MTIA equivalent of `torch.cuda.memory_stats` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673 Approved by: https://github.com/egienvalue	2024-07-25 21:59:59 +00:00
Yanbo Liang	a34692c0a3	[Inductor] Added and_masks and or_masks utilities & make fully masked out rows 0 instead of nan (#131552 ) Combine #131073 and #131012 and fix doc building failures. Co-authored-by: chilli <chilli@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131552 Approved by: https://github.com/Chillee	2024-07-25 21:29:46 +00:00
Mikayla Gawarecki	c3d099ddd1	[BE][Easy] Add hooks to doc for Optimizer base class (#131628 ) Happened to notice this was missing from the base class (but is rendering for the other optimizers like Adam etc.) when I wanted to link the state_dict hooks for https://discuss.pytorch.org/t/global-not-per-param-optimizer-state/206769 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131628 Approved by: https://github.com/janeyx99	2024-07-25 15:07:08 +00:00
Jane Xu	9c4cf866c2	Adafactor forloop basic impl (#129905 ) #109581 At this point, the vanilla implementation (the default) is good. Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor Specifically, the impl in this PR, which attempts to replicate the paper, ``` optim = torch.optim.Adafactor([weight]) ``` is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor ``` optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False) ``` is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor ``` optim = keras.optimizers.Adafactor(learning_rate=0.01) ``` The three results respectively for the same randomly generated weights: ``` # ours tensor([[ 0.3807594, -0.3912092], [ 0.0762539, 0.5377805], [ 0.2459473, 0.4662207]]) # pytorch-optimizer tensor([[ 0.3807592, -0.3912172], [ 0.0762507, 0.5377818], [ 0.2459457, 0.4662213]]) # keras array([[ 0.38076326, -0.39121315], [ 0.0762547 , 0.5377859 ], [ 0.24594972, 0.46622536]], dtype=float32) ``` This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences: * keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01. * We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling. <details> Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing My script repro: ``` import torch from pytorch_optimizer import AdaFactor torch.set_printoptions(precision=7) weight = torch.tensor([[ 0.37697506, -0.39500135], [ 0.07246649, 0.53399765], [ 0.24216151, 0.46243715]], dtype=torch.float32) # bias = torch.tensor([0, 0], dtype=torch.float32) weight.grad = torch.tensor([[-0.5940447, -0.7743838], [-0.5940447, -0.7743838], [-0.5940447, -0.7743838]], dtype=torch.float32) # bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32) weight_c = weight.clone() weight_c.grad = weight.grad.clone() optim = torch.optim.Adafactor([weight]) optim.step() print(weight) optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False) optim_c.step() print(weight_c) ``` <details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905 Approved by: https://github.com/albanD	2024-07-25 13:17:19 +00:00
Haoci Zhang	8fe5b93667	support zb1p and zb2p algorithms (#130752 ) Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported. We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached. [zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752 Approved by: https://github.com/H-Huang	2024-07-24 17:58:46 +00:00
Jun Luo	abb313b466	[torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873 ) Summary: As title Test Plan: CI tests Reviewed By: joebos Differential Revision: D59036602 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873 Approved by: https://github.com/hanzlfs	2024-07-24 01:52:21 +00:00
Shangdi Yu	68c725a094	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 17:48:38 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `5b5e0698a5`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
PyTorch MergeBot	b435d84261	Revert "[custom ops] Add register_vmap for custom ops (#130589 )" This reverts commit `074b420641`. Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))	2024-07-23 01:44:44 +00:00
Shangdi Yu	074b420641	[custom ops] Add register_vmap for custom ops (#130589 ) Fixes #130284 Fixes #130653 - Add `torch.library.register_vmap` to custom ops - Add `register_vmap` for operators in ops in custom_op_db. - Make `torch.autograd.Function` support kwarg-only kwargs for vmap - test operators in op_db with `tests/test_vmap`. - change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589 Approved by: https://github.com/zou3519	2024-07-23 00:54:52 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
PyTorch MergeBot	26383a6cc0	Revert "Added and_masks and or_masks utilities (#131073 )" This reverts commit `92bb323d36`. Reverted https://github.com/pytorch/pytorch/pull/131073 on behalf of https://github.com/albanD due to The docs build fails here and in trunk ([comment](https://github.com/pytorch/pytorch/pull/131073#issuecomment-2242997958))	2024-07-22 13:44:55 +00:00
chilli	92bb323d36	Added and_masks and or_masks utilities (#131073 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131073 Approved by: https://github.com/drisspg ghstack dependencies: #130871, #130904	2024-07-22 11:48:03 +00:00
Soumith Chintala	8e478d4fb1	Add Alban and Piotr into Core Maintainers (#130903 ) See official announcement here: https://dev-discuss.pytorch.org/t/alban-desmaison-and-piotr-bialecki-are-now-pytorch-core-maintainers/2280 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130903 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-07-20 16:02:42 +00:00
Li-Huai (Allan) Lin	125be005eb	[Docs] Fix fake tensor doc (#131205 ) Fix this: `# AttributeError: 'FakeTensorMode' object has no attribute 'from_real_tensor'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131205 Approved by: https://github.com/eellison	2024-07-19 17:59:45 +00:00
Jerry Zhang	793b17ebcb	Add numeric_debugger top level APIs (#130643 ) Summary: Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model and calculate summary for metric comparisons between nodes in two graphs * `prepare_for_propagation_comparison` * `extract_results_from_loggers` * `compare_results` Test Plan: python test/test_quantization.py -k test_prepare_for_propagation_comparison python test/test_quantization.py -k test_extract_results_from_loggers Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-18 20:54:18 +00:00
redwrasse	63a0a65df9	Define 'zero-preserving unary functions' in docs (#130804 ) Make explicit the definition of 'zero-preserving unary functions' in the sparse tensors documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130804 Approved by: https://github.com/soulitzer	2024-07-18 13:30:29 +00:00
drisspg	2b43d339fe	Make FlexAttention API public (#130755 ) # Summary Makes the prototype API flex_attention public Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755 Approved by: https://github.com/Chillee	2024-07-16 16:21:25 +00:00
Xuehai Pan	a3abfa5cb5	[BE][Easy][1/19] enforce style for empty lines in import segments (#129752 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129752 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-07-16 00:42:56 +00:00
Jerry Zhang	b893aa71ca	Rename generate_numeric_debug_handle to numeric_debugger (#130590 ) Summary: att Test Plan: CI Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590 Approved by: https://github.com/dulinriley, https://github.com/tarun292	2024-07-15 22:42:27 +00:00
Yu, Guangye	7cd48df2da	Refine the logic of device construction when only device index is given (#129119 ) # Motivation Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) >>> b tensor([1, 2], device='cuda:0') ``` It works well on CUDA GPU. But it will raise unexpected information and error running on XPU. ```bash >>> import torch >>> device = torch.device(0) >>> device.type 'cuda' >>> a = torch.tensor([1, 2]) >>> b = a.to(0) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled ``` With this PR, refine the logic to use the currently available device type instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119 Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang ghstack dependencies: #129463, #129205, #129363	2024-07-15 14:34:29 +00:00
Yu, Guangye	9cae2160f5	Introduce the concept of Accelerators to PyTorch doc (#129363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #129463, #129205	2024-07-15 14:24:46 +00:00
Mikayla Gawarecki	7c289c2a5c	Add torch.serialization.safe_globals context manager (#127939 ) Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939 Approved by: https://github.com/albanD	2024-07-12 20:38:43 +00:00
rzou	9c69684af8	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-12 14:13:01 +00:00
Shangdi Yu	fb9bc6d74a	[custom op] add doc for CustomOpDef.set_kernel_enabled (#130406 ) <img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096"> <img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406 Approved by: https://github.com/zou3519	2024-07-11 15:47:35 +00:00
Shangdi Yu	a4576dad34	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-11 03:39:07 +00:00
PyTorch MergeBot	86bca69c5f	Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261 )" This reverts commit `bb9a73f767`. Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))	2024-07-10 21:43:28 +00:00
PyTorch MergeBot	e14a0f45ed	Revert "[reland][custom ops] infer schema (#130079 )" This reverts commit `bef085bdfa`. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))	2024-07-10 21:40:16 +00:00
Shangdi Yu	bef085bdfa	[reland][custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-10 16:18:36 +00:00
rzou	bb9a73f767	[custom_ops] expose torch.library.register_torch_dispatch (#130261 ) This is the API for defining the interaction between a torch_dispatch class and a custom op. Taking API bikeshedding. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261 Approved by: https://github.com/albanD ghstack dependencies: #130064	2024-07-09 21:11:27 +00:00
Yuanhao Ji	312652c325	[RFC] Add support for device extension autoloading (#127074 ) Fixes #122468 - Load device extensions at the end of `torch/__init__.py` - Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0` run test: ```python python test/run_test.py -i test_autoload_enable python test/run_test.py -i test_autoload_disable ``` doc: https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html co-author: @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074 Approved by: https://github.com/albanD, https://github.com/jgong5	2024-07-09 06:14:13 +00:00
PyTorch MergeBot	44a773c121	Revert "[custom ops] infer schema (#130079 )" This reverts commit `3fe324ffb6`. Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit `3fe324ffb6` ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))	2024-07-08 22:02:29 +00:00
Shangdi Yu	3fe324ffb6	[custom ops] infer schema (#130079 ) Fixes #129617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079 Approved by: https://github.com/zou3519	2024-07-08 20:46:23 +00:00
Kurt Mohler	e590168865	Enable sharing meta tensors between processes (#129520 ) Fixes #129436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129520 Approved by: https://github.com/ezyang	2024-07-04 20:29:48 +00:00
Li-Huai (Allan) Lin	42f3d7e948	[MPS] Add mps profiler env vars to docs (#129552 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129552 Approved by: https://github.com/malfet ghstack dependencies: #129451	2024-07-04 06:44:48 +00:00
Zhengxu Chen	042d764872	[export] Update example inputs format for DB. (#129982 ) Summary: To give user a simpler example code, we are getting rid of ExportArgs in favor of example_args and example_kwargs. Test Plan: CI Differential Revision: D59288920 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129982 Approved by: https://github.com/angelayi	2024-07-03 17:53:15 +00:00
Edward Z. Yang	29c68df600	Stop immediately specializing common constants 0/1 for plain int (#128327 ) Fixes https://github.com/pytorch/pytorch/issues/128319 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327 Approved by: https://github.com/lezcano ghstack dependencies: #129983	2024-07-03 16:41:51 +00:00
Howard Huang	4eb449f7dc	[pipelining] add small logging section to docs (#129368 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129368 Approved by: https://github.com/wconstab	2024-07-02 18:19:28 +00:00
Haoci Zhang	1ad683033b	Implemented flexible PP schedule (#129597 ) Enabled some cases to work where num_microbatches % pp_size != 0. Using the flex_pp schedule, we will have num_rounds = max(1, n_microbatches // pp_group_size) and it works as long as n_microbatches % num_rounds is 0. As a few examples, support pp_group_size = 4, n_microbatches = 10. We will have num_rounds = 2 and n_microbatches % 2 is 0. pp_group_size = 4, n_microbatches = 3. We will have num_rounds = 1 and n_microbatches % 1 is 0. Moved over from PiPPy (https://github.com/pytorch/PiPPy/pull/1129) Tested using the config in (1), schedule looks like the following graph: ``` =========== ALL_RANK_ACTIONS =========== Rank 0 Rank 1 Rank 2 Rank 3 Step 00: F0_s0 None None None Step 01: F1_s0 F0_s1 None None Step 02: F2_s0 F1_s1 F0_s2 None Step 03: F3_s0 F2_s1 F1_s2 F0_s3 Step 04: F4_s0 F3_s1 F2_s2 F1_s3 Step 05: F0_s4 F4_s1 F3_s2 F2_s3 Step 06: F1_s4 F0_s5 F4_s2 F3_s3 Step 07: F2_s4 F1_s5 F0_s6 F4_s3 Step 08: F3_s4 F2_s5 F1_s6 F0_s7 Step 09: F4_s4 F3_s5 None B0_s7 Step 10: F5_s0 None F2_s6 F1_s7 Step 11: None None B0_s6 B1_s7 Step 12: None F4_s5 F3_s6 F2_s7 Step 13: None B0_s5 B1_s6 B2_s7 Step 14: F6_s0 F5_s1 F4_s6 F3_s7 Step 15: B0_s4 B1_s5 B2_s6 B3_s7 Step 16: F7_s0 F6_s1 F5_s2 F4_s7 Step 17: B1_s4 B2_s5 B3_s6 B4_s7 Step 18: F8_s0 F7_s1 F6_s2 F5_s3 Step 19: B2_s4 B3_s5 B4_s6 B0_s3 Step 20: F9_s0 F8_s1 F7_s2 F6_s3 Step 21: B3_s4 B4_s5 B0_s2 B1_s3 Step 22: F5_s4 F9_s1 F8_s2 F7_s3 Step 23: B4_s4 B0_s1 B1_s2 B2_s3 Step 24: F6_s4 F5_s5 F9_s2 F8_s3 Step 25: B0_s0 B1_s1 B2_s2 B3_s3 Step 26: F7_s4 F6_s5 F5_s6 F9_s3 Step 27: B1_s0 B2_s1 B3_s2 B4_s3 Step 28: F8_s4 F7_s5 F6_s6 F5_s7 Step 29: B2_s0 B3_s1 B4_s2 B5_s7 Step 30: F9_s4 F8_s5 F7_s6 F6_s7 Step 31: B3_s0 B4_s1 B5_s6 B6_s7 Step 32: None F9_s5 F8_s6 F7_s7 Step 33: B4_s0 B5_s5 B6_s6 B7_s7 Step 34: None None F9_s6 F8_s7 Step 35: B5_s4 B6_s5 B7_s6 B8_s7 Step 36: None None None F9_s7 Step 37: B6_s4 B7_s5 B8_s6 B9_s7 Step 38: None None None None Step 39: B7_s4 B8_s5 B9_s6 B5_s3 Step 40: None None None None Step 41: B8_s4 B9_s5 B5_s2 B6_s3 Step 42: None None None None Step 43: B9_s4 B5_s1 B6_s2 B7_s3 Step 44: None None None None Step 45: B5_s0 B6_s1 B7_s2 B8_s3 Step 46: None None None None Step 47: B6_s0 B7_s1 B8_s2 B9_s3 Step 48: None None None Step 49: B7_s0 B8_s1 B9_s2 Step 50: None None Step 51: B8_s0 B9_s1 Step 52: None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129597 Approved by: https://github.com/H-Huang	2024-07-02 07:54:38 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
PyTorch MergeBot	3d96217891	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `9e1f3ecaa7`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))	2024-06-29 00:47:15 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `b7e7a4cb01`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
Xuehai Pan	9e1f3ecaa7	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-28 00:35:15 +00:00
Li-Huai (Allan) Lin	84ad5452f6	[MPS] Fused SGD optimizer (#129350 ) ``` [-------------------------------------- Fused SGD --------------------------------------] \| Fused: True \| Fused: False 1 threads: ------------------------------------------------------------------------------ numel: 1024, num_tensors: 100, momentum: True \| 2 \| 15 numel: 1024, num_tensors: 100, momentum: False \| 2 \| 5 numel: 65536, num_tensors: 100, momentum: True \| 3 \| 16 numel: 65536, num_tensors: 100, momentum: False \| 2 \| 5 numel: 1048576, num_tensors: 100, momentum: True \| 11 \| 16 numel: 1048576, num_tensors: 100, momentum: False \| 8 \| 6 numel: 1024, num_tensors: 500, momentum: True \| 29 \| 70 numel: 1024, num_tensors: 500, momentum: False \| 20 \| 24 numel: 65536, num_tensors: 500, momentum: True \| 33 \| 76 numel: 65536, num_tensors: 500, momentum: False \| 22 \| 26 numel: 1048576, num_tensors: 500, momentum: True \| 70 \| 80 numel: 1048576, num_tensors: 500, momentum: False \| 43 \| 40 numel: 1024, num_tensors: 1000, momentum: True \| 108 \| 139 numel: 1024, num_tensors: 1000, momentum: False \| 72 \| 48 numel: 65536, num_tensors: 1000, momentum: True \| 116 \| 150 numel: 65536, num_tensors: 1000, momentum: False \| 77 \| 52 numel: 1048576, num_tensors: 1000, momentum: True \| 190 \| 170 numel: 1048576, num_tensors: 1000, momentum: False \| 120 \| 50 ``` ```python def profile_fused_sgd(): from torch.optim.sgd import sgd import torch.utils.benchmark as benchmark import itertools def profile(fn, params, grads, momentum_buffer_list, fused): fn( params, grads, momentum_buffer_list, momentum=True if len(momentum_buffer_list) > 0 else False, dampening=0.0, nesterov=False, foreach=False, fused=fused, lr=1e-3, weight_decay=.0, maximize=False, grad_scale=None, found_inf=None, ) torch.mps.synchronize() device = "mps" results = [] for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]): sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}" print(sublabel) params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)] momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else [] fn = sgd for fused in [True, False]: t = benchmark.Timer( stmt='profile(fn, params, grads, momentum_buffer_list, fused)', label='Fused SGD', sub_label=sublabel, globals=locals(), description= f"Fused: {fused}", ).blocked_autorange(min_run_time=5) results.append(t) compare = benchmark.Compare(results) compare.trim_significant_figures() compare.colorize(rowwise=True) compare.print() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350 Approved by: https://github.com/janeyx99 ghstack dependencies: #129006, #129008, #129007, #129105	2024-06-27 04:37:14 +00:00
PyTorch MergeBot	895316119d	Revert "[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 )" This reverts commit `0314c4c101`. Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))	2024-06-26 19:03:57 +00:00
Shangdi Yu	cca85c96cd	[export] minor typo fix (#129543 ) Fixes a typo in torch.export doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129543 Approved by: https://github.com/angelayi	2024-06-26 18:35:31 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
Zhengxu Chen	e58ef5b65f	[export] Rewrite exportdb formatting. (#129260 ) Summary: It'll be easier to generate examples if the code doesn't depend on exportdb library. Test Plan: CI Differential Revision: D58886554 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129260 Approved by: https://github.com/tugsbayasgalan	2024-06-25 21:04:53 +00:00
Li-Huai (Allan) Lin	71ebe5121a	[MPS] Fast math env var (#129007 ) Allow users to decide whether they want to have fast math enabled via env var Pull Request resolved: https://github.com/pytorch/pytorch/pull/129007 Approved by: https://github.com/malfet ghstack dependencies: #129006, #129008	2024-06-25 13:52:07 +00:00
Xuehai Pan	0314c4c101	[BE][Easy] use `pathlib.Path` instead of `dirname` / `".."` / `pardir` (#129374 ) Changes by apply order: 1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`. 2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`. 3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first. `.parent{...}.absolute()` -> `.absolute().parent{...}` 4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.) `.parent.parent.parent.parent` -> `.parents[3]` 5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~ ~`.parents[3]` -> `.parents[4 - 1]`~ 6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374 Approved by: https://github.com/justinchuby, https://github.com/malfet	2024-06-25 08:28:38 +00:00
Will Constable	2f8b301c32	Clean up distributed/CONTRIBUTING.md (#128450 ) Click [here](`cf6c88af48/torch/distributed/CONTRIBUTING.md`) to see the rendered version of the file in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/128450 Approved by: https://github.com/wanchaol	2024-06-22 02:41:22 +00:00
rzou	311fadb1fb	[docs] Redirect custom ops landing page to the correct place (#129177 ) I'm moving it to pytorch/tutorials Pull Request resolved: https://github.com/pytorch/pytorch/pull/129177 Approved by: https://github.com/albanD	2024-06-21 13:31:32 +00:00
cyy	5c676bb8b3	Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-06-21 06:16:44 +00:00
Jing Xu	5fba5d83f0	add xpu for amp (#127276 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to AMP doc. Co-authored-by: Yu, Guangye <guangye.yu@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127276 Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/malfet	2024-06-20 21:49:35 +00:00
Zhengxu Chen	65286883d4	[export] reland "experimental joint graph API." (#129081 ) Summary: previous diff got reverted despite CI was green. Test Plan: CI Differential Revision: D58790048 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129081 Approved by: https://github.com/tugsbayasgalan	2024-06-20 16:50:53 +00:00
Oguz Ulgen	54b0006cb2	Evaluate symexprs on load path of cache not write (#128997 ) When caching is enabled, an internal model fails with ``` assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1)) AssertionError: expected size 17==17, stride 57344==54784 at dim=0 ``` looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected. This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path. I have not been able to come up with a unit test repro for this problem. Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997 Approved by: https://github.com/ezyang	2024-06-20 08:55:12 +00:00
Li-Huai (Allan) Lin	19f3abcde4	[Docs][MPS] Add mps environment variable table (#129008 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129008 Approved by: https://github.com/malfet ghstack dependencies: #129006	2024-06-20 03:30:35 +00:00
PyTorch MergeBot	df94d57c0a	Revert "[export] experimental joint graph API. (#128847 )" This reverts commit `0707811286`. Reverted https://github.com/pytorch/pytorch/pull/128847 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128847#issuecomment-2179326891))	2024-06-19 19:04:36 +00:00
Zhengxu Chen	0707811286	[export] experimental joint graph API. (#128847 ) Summary: WARNING: This API is highly unstable and will be subject to change in the future. Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph. Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental Differential Revision: D55657917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847 Approved by: https://github.com/tugsbayasgalan	2024-06-19 16:45:27 +00:00
Li-Huai (Allan) Lin	0fc603ece4	[optim] Fused implementation stability table (#129006 ) I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006 Approved by: https://github.com/malfet	2024-06-19 16:29:49 +00:00
soulitzer	1877b7896c	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) ### bc-breaking for existing users of the private API: - Existing policy functions must now change their return value to be [CheckpointPolicy](`c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230)`) Enum instead of bool. - To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy. - Policy function now accepts a `ctx` object instead of `mode` for its first argument. - To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE : metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-18 18:18:50 +00:00
Boyuan Feng	43998711a7	[CUDAGraph] add more docs for cudagraph trees (#127963 ) This PR adds more documentation for CUDAGraph Trees, including - Iteration Support - Input Mutation Support - Dynamic Shape Support - NCCL Support - Reasons for Skipping CUDAGraph Pull Request resolved: https://github.com/pytorch/pytorch/pull/127963 Approved by: https://github.com/eellison	2024-06-18 02:07:07 +00:00
ibartol	c6b180a316	Created docs (and example) for cudart function in torch.cuda (#128741 ) Fixes #127908 ## Description Created docs to document the torch.cuda.cudart function to solve the issue #127908. I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise). Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake. ### Summary of Changes - Added docs for torch.cuda.cudart - Added the cudart function in the autosummary of docs/source/cuda.rst ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741 Approved by: https://github.com/msaroufim	2024-06-17 16:50:37 +00:00
BowenBao	ab13980424	[ONNX] Update 'person_of_interest.rst', 'CODEOWNERS' and 'merge_rules.yaml' (#126364 ) The following are all constrained under the ONNX exporter project scope. - `personal_of_interest.rst` - Moving folks no longer working on the project to emeritus. - Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre, who have all made countless contributions to this project. - `CODEOWNERS` - Removing folks no longer working on the project. - Updating new owners who will now be notified with PRs related to the specific file paths. - `merge_rules.yaml` - Removing folks no longer working on the project. 🫡 Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364 Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD	2024-06-16 04:52:16 +00:00
Zheng, Zhaoqiong	a2d9c430b4	Adding a note for Getting Started with PyTorch on Intel GPUs (#127872 ) Adding a note for Getting Started with PyTorch on Intel GPUs Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872 Approved by: https://github.com/svekars	2024-06-14 14:24:28 +00:00
PyTorch MergeBot	6895a5804c	Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795 )" This reverts commit `c472cec565`. Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))	2024-06-14 01:14:59 +00:00
Jing Xu	8763d44bf1	add xpu to torch.compile (#127279 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279 Approved by: https://github.com/dvrogozh, https://github.com/svekars	2024-06-13 21:15:09 +00:00
Jing Xu	7fe9ab9ccc	update amp example to device-agnostic (#127278 ) As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic. Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars	2024-06-13 02:01:16 +00:00
soulitzer	c472cec565	[checkpoint] Clean up selective activation checkpoint and make public (#125795 ) Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit Memory considerations: - As with the existing SAC, cached values are cleared upon first use. - We error if the user wishes to backward a second time on a region forwarded with SAC enabled. In-place: - We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed. - `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place) Randomness, views - Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors) Tensor object preservation - We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor. Policy function - Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error. - The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3). - The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute. - The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below). Tensors guaranteed to be the same tensor as-is - Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary. "bc-breaking" for existing users of the private API: - Existing policy functions must now change their return value to use the Enum. - Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint). Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795 Approved by: https://github.com/Chillee, https://github.com/fmassa	2024-06-12 23:57:33 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `4c971932e8`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Jing Xu	205410cb44	add xpu to torch.tensors (#127280 ) As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280 Approved by: https://github.com/svekars	2024-06-11 18:13:01 +00:00
Ke Wen	fe39c07826	[pipelining][doc] Remove duplicated words (#128368 ) "for execution" is used in both step titles Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368 Approved by: https://github.com/wconstab ghstack dependencies: #128361	2024-06-11 04:52:57 +00:00
Ke Wen	4077cdd589	[pipelining][doc] Update arg list of pipeline API (#128361 ) And document the use of `build_stage` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128361 Approved by: https://github.com/wconstab	2024-06-11 02:55:17 +00:00
Jun Luo	f843ccbb1a	[MTIA] Add set_device support (#128040 ) Summary: Support set_device API in MTIA backend. Reviewed By: gnahzg Differential Revision: D58089498 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040 Approved by: https://github.com/gnahzg	2024-06-10 23:42:52 +00:00
loganthomas	583a56d5a8	DOC: add docstring to construct_and_record_rdzv_event() (#128189 ) Fixes #127902 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189 Approved by: https://github.com/kurman	2024-06-10 22:17:33 +00:00
Shuqiang Zhang	c7e2c9c37e	[c10d][doc] add a doc page for NCCL ENVs (#128235 ) Addressing issue: https://github.com/pytorch/pytorch/issues/128204 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128235 Approved by: https://github.com/wconstab	2024-06-09 16:08:38 +00:00
eqy	4c971932e8	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-09 06:53:34 +00:00
Ke Wen	613c7d270d	[pipelining] Format doc (#128279 ) - Should use two dots around `var` - Wrap lines - Add section cross ref Pull Request resolved: https://github.com/pytorch/pytorch/pull/128279 Approved by: https://github.com/H-Huang ghstack dependencies: #128273, #128278	2024-06-08 04:59:04 +00:00
Ke Wen	2e42671619	[pipelining] Rename to stage.py and schedules.py (#128278 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128278 Approved by: https://github.com/H-Huang ghstack dependencies: #128273	2024-06-08 04:42:35 +00:00
Ke Wen	0e3fe694d1	[pipelining] Restore a stage constructor for tracer path (#128273 ) In case user modified stage module out of place, such as mod = DDP(mod) mod = torch.compile(mod) They need a stage builder else than `pipe.build_stage()`. This PR provides an API to do so: ``` def build_stage( stage_module, stage_index, pipe.info(), ... ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273 Approved by: https://github.com/wconstab	2024-06-08 04:42:35 +00:00
Will Constable	f9508b4c1f	[pipelining] Update Pipelining Docs (#128236 ) ---- - Bring PipelineStage/Schedule more front-and-center - provide details on how to manually construct PipelineStage - move tracer example and manual example below so the high-level flow (e2e) is closer to the top Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236 Approved by: https://github.com/H-Huang ghstack dependencies: #128201, #128228	2024-06-08 02:03:46 +00:00
Ke Wen	ad96f991a5	[pipelining] Add pipe.build_stage() (#128240 ) Given `PipelineStage` name to manual side. Thus adding a method under `Pipe` to create PipelineStage. Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-08 01:26:02 +00:00
Howard Huang	bef586111a	[pipelining] pipelining.rst updates (#128228 ) fix some nits and add `PipelineStage` (manual) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128228 Approved by: https://github.com/wconstab ghstack dependencies: #128201	2024-06-07 23:29:54 +00:00
Ke Wen	3090667cf9	[pipelining] pipeline() taking microbatch as example input (#128163 ) Changed the API of `pipeline()` to take microbatch instead of full batch as example args. Main purpose is to: - make this API more atomic; - decouple tracing frontend from runtime info like `num_chunks`. Side effects: - Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object. - User has to create example microbatch input. - Chunk spec stuff are now all moved to runtime side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163 Approved by: https://github.com/H-Huang	2024-06-07 15:51:53 +00:00
Howard Huang	543a870943	[pipelining] Rename ManualPipelineStage -> PipelineStage (#128157 ) Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157 Approved by: https://github.com/wconstab	2024-06-07 09:24:16 +00:00
chunyuan	7efaeb1494	[AOTI] docs: add suggestion to turn on freezing on CPU (#128010 ) With https://github.com/pytorch/pytorch/pull/124350 landed, it is now suggested in AOTI to turn on freezing on CPU to get better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128010 Approved by: https://github.com/desertfire	2024-06-07 08:57:02 +00:00
Ke Wen	01601ebd41	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-07 08:11:58 +00:00
Ke Wen	96806b1777	[pipelining][doc] Add frontend description and change tracer example (#128070 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128070 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2024-06-07 04:09:36 +00:00
Pian Pawakapan	50155e825b	[export] provide refine function for automatically accepting dynamic shapes suggested fixes (#127436 ) Summary: Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes. Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used. Example usage for the automatic dynamic shapes workflow: ``` # export, fail, parse & refine suggested fixes, re-export try: export(model, inps, dynamic_shapes=dynamic_shapes) except torch._dynamo.exc.UserError as exc: new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes) export(model, inps, dynamic_shapes=new_shapes) ``` For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅 Test Plan: test_export tests Differential Revision: D57409142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436 Approved by: https://github.com/avikchaudhuri	2024-06-07 03:29:06 +00:00
brightonanc	6dfdce92ba	Fixed typos in the complex numbers portion of the autograd docs (#127948 ) This PR fixes several typos in the complex numbers section of the docs for autograd. Only documentation was altered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127948 Approved by: https://github.com/soulitzer	2024-06-06 22:47:04 +00:00
ibartol	bb2de3b101	Fixed broken link and removed unfinished sentence from issue #126367 (#127938 ) Fixes #126367. ## Description Fixed a broken link in the pytorch/docs/source/torch.compiler_faq.rst doc and deleted a few words that were extra according to the issue tagged above. ## Checklist - [X] The issue that is being fixed is referred in the description - [X] Only one issue is addressed in this pull request - [X] Labels from the issue that this PR is fixing are added to this pull request - [X] No unnecesary issues are included into this pull request Pull Request resolved: https://github.com/pytorch/pytorch/pull/127938 Approved by: https://github.com/msaroufim	2024-06-05 07:37:32 +00:00
Svetlana Karslioglu	20f966a8e0	Ignore undocumented PipelineSchedule.step (#127955 ) Ignore undocumented PipelineSchedule.step to fix doc build: https://github.com/pytorch/pytorch/actions/runs/9372492435/job/25805861083?pr=127938#step:11:1284 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/127955 Approved by: https://github.com/kit1980	2024-06-04 22:11:09 +00:00
Tristan Rice	597922ba21	Reapply "distributed debug handlers (#126601 )" (#127805 ) This reverts commit `7646825c3e`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805 Approved by: https://github.com/PaliC	2024-06-04 19:44:30 +00:00
PyTorch MergeBot	0ff60236ab	Revert "Retire torch.distributed.pipeline (#127354 )" This reverts commit `b9c058c203`. Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit `b9c058c203` ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))	2024-06-04 18:19:31 +00:00
Ke Wen	b9c058c203	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-04 07:03:26 +00:00
Jeff Daily	0e7bd7fedd	[ROCm] TunableOp improvements (#124362 ) - use less memory; smaller default hipblaslt workspace size - options to avoid cache effects - icache flush option - rotating buffers during tuning - python APIs - unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362 Approved by: https://github.com/xw285cornell	2024-06-03 22:30:11 +00:00
Sheng Fu	c1dd3a615f	Implement Graph Transform Observer (#127427 ) Summary: Implement Graph Transform Observer Differential Revision: D57887518 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427 Approved by: https://github.com/angelayi	2024-06-02 06:49:47 +00:00
PyTorch MergeBot	7646825c3e	Revert "distributed debug handlers (#126601 )" This reverts commit `3d541835d5`. Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))	2024-05-31 01:21:24 +00:00
Alex Baden	5d316c81be	[Inductor] Add 0 initialization to Triton masked loads (#127311 ) For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur. Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized. Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called. Fixes #126535 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311 Approved by: https://github.com/jansel	2024-05-30 04:50:54 +00:00
Tristan Rice	3d541835d5	distributed debug handlers (#126601 ) This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-05-30 02:21:08 +00:00
rzou	1abcac9dab	New Custom Ops Documentation landing page (#127400 ) We create a new landing page for PyTorch custom ops (suggested by jansel). All of our error messages will link here, and I'll work with the docs team to see if we can boost SEO for this page. NB: the landing page links some non-searchable webpages. Two of those (the Python custom ops tutorial and C++ custom ops tutorial) will turn into actual webpages when PyTorch 2.4 comes around. I'll make the third one (the Custom Operators Manual) once it stabilizes (we continously add new things to it and the length means that we might want to create a custom website for it to make the presentation more ingestable). Test Plan: - view docs preview. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127400 Approved by: https://github.com/jansel ghstack dependencies: #127291, #127292	2024-05-30 01:06:04 +00:00
Edward Z. Yang	76fc58c160	Document the legacy constructor for Tensor (#122625 ) Fixes https://github.com/pytorch/pytorch/issues/122408 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122625 Approved by: https://github.com/albanD	2024-05-29 23:23:19 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit `7763c83af6`. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
Xuehai Pan	35ea5c6b22	[3/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torchgen (#127124 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124 Approved by: https://github.com/Skylion007 ghstack dependencies: #127122, #127123	2024-05-25 19:20:03 +00:00
Yu, Guangye	e7a42702f9	generalize custom_fwd&custom_bwd to be device-agnostic (#126531 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang ghstack dependencies: #126527	2024-05-25 06:48:16 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
lezcano	a30baec0c3	[Docs] Fix NumPy + backward example (#126872 ) We were calling backward on a tensor not a scalar... Pull Request resolved: https://github.com/pytorch/pytorch/pull/126872 Approved by: https://github.com/albanD	2024-05-22 21:29:31 +00:00
Kurman Karabukaev	d62b025efc	[TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743 ) Summary: 1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store. 2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a rdzv_handler where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return. - Depending on the implementation they can either: - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared. - build args that `torch.distributed.init_process_group` can bootstrap by creating new store. Additional points: - When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases. - `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes. Why: - Reduce moving parts - easier to swap implementation - improve tractability - addressing perf/debug-ability will benefit all usecases - Test Plan: CI Differential Revision: D57055235 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743 Approved by: https://github.com/d4l3k	2024-05-22 18:24:11 +00:00
Ke Wen	403012b50a	[pipelining] expose APIs per pytorch rule (#126812 ) Rule is enforced by #126103. The rule: - If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`. - `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported. - All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812 Approved by: https://github.com/wconstab	2024-05-22 16:21:13 +00:00
Sahdev Zala	fe0a36fd7c	Fix a link in the compiler backend doc (#126079 ) The core aten is the core subset of aten and seems the corrent link to replace the broken link. Fixes #125961 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126079 Approved by: https://github.com/svekars	2024-05-21 20:16:04 +00:00
Joel Schlosser	31ba6ee49b	Traceable wrapper subclass support for deferred runtime asserts (#126198 ) The padded dense -> jagged conversion op has the signature: ``` _fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor ``` when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198 Approved by: https://github.com/ezyang	2024-05-21 01:21:46 +00:00
Mikayla Gawarecki	66dc8fb7ff	Allow tensor subclasses and add `torch.serialization.add_safe_globals` that allows users to allowlist classes for `weights_only` load (#124331 ) #### Conditions for allowlisting tensor subclasses We allow tensor subclasses types that (1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`) (2) Use the generic `tp_alloc` (3) Are in a module that has been imported by the user to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2` Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution. The rationale for the 3 conditions above is as follows: The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`) `4e66aaa010/torch/_tensor.py (L57-L71)` `as_subclass` is implemented with a call to `THPVariable_NewWithVar` that will eventually call `tp_alloc` here `4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)` The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc` Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling* ### How do we check something is a tensor subclass/constraints around imports In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We do not arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)` This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`) ### API for allow listing This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe). Next steps: - Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331 Approved by: https://github.com/albanD	2024-05-17 17:56:57 +00:00
yuanx749	691af57fbc	Fix broken link of scikit-learn (#120972 ) The link is broken in https://pytorch.org/docs/main/community/design.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/120972 Approved by: https://github.com/Skylion007	2024-05-16 11:46:34 +00:00
Edward Z. Yang	44efeac24e	Beef up error message for pending assert failure (#126212 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126212 Approved by: https://github.com/Skylion007	2024-05-15 18:22:53 +00:00
Oguz Ulgen	79655a1321	Add force_disable_caches to the docs (#126184 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126184 Approved by: https://github.com/msaroufim	2024-05-15 07:16:08 +00:00

... 3 4 5 6 7 ...

2950 Commits