pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
cyy	d558c1a047	Enable cppcoreguidelines-special-member-functions (#139132 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132 Approved by: https://github.com/sraikund16	2024-11-06 13:42:20 +00:00
Nikita Shulga	c0c6bf4ef2	Don't use deprecated type properties in UpsampleKernel (#139399 ) By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399 Approved by: https://github.com/Skylion007 ghstack dependencies: #139353	2024-11-06 13:34:45 +00:00
PyTorch MergeBot	44e4949bcf	Revert "[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595 )" This reverts commit `22e89ea2aa`. Reverted https://github.com/pytorch/pytorch/pull/139595 on behalf of https://github.com/malfet due to It broke number of tests, see `22e89ea2aa` ([comment](https://github.com/pytorch/pytorch/pull/139595#issuecomment-2459754355))	2024-11-06 13:31:26 +00:00
PyTorch MergeBot	10d7729333	Revert "Enable cppcoreguidelines-special-member-functions (#139132 )" This reverts commit `a9b4989c72`. Reverted https://github.com/pytorch/pytorch/pull/139132 on behalf of https://github.com/ZainRizvi due to Sorry but this fails on trunk. See inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_smooth_quant_with_int_mm [GH job link](https://github.com/pytorch/pytorch/actions/runs/11699366379/job/32591132460) [HUD commit link](`22e89ea2aa`) ([comment](https://github.com/pytorch/pytorch/pull/139132#issuecomment-2459743145))	2024-11-06 13:27:42 +00:00
PyTorch MergeBot	53299b8a38	Revert "Don't use deprecated type properties in UpsampleKernel (#139399 )" This reverts commit `0058f71002`. Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/malfet due to And it was backed out again due to the internal usages of deprecated API ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2459740090))	2024-11-06 13:23:43 +00:00
Jack Taylor	5f266b5a02	[ROCm] re-enable flex attention UTs (#139632 ) https://github.com/pytorch/pytorch/pull/136792 accidentally disabled flex attention UTs on ROCm. Re-enabling. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139632 Approved by: https://github.com/drisspg	2024-11-06 12:49:44 +00:00
Michael Lazos	d622b490d6	[Dynamo] Support tensor mro without source (#139838 ) Fixes https://github.com/pytorch/pytorch/issues/137743 The issue here is that if `type` was called on a tensor without a source, we wouldn't have a source even for `torch.Tensor`, and the `__mro__` retrieval would fail. Since `torch.Tensor` is an internal torch type, I add handling for it in `call_type` in builtins. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139838 Approved by: https://github.com/williamwen42	2024-11-06 08:52:53 +00:00
cyy	a9b4989c72	Enable cppcoreguidelines-special-member-functions (#139132 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139132 Approved by: https://github.com/sraikund16	2024-11-06 07:59:09 +00:00
Xia, Weiwen	22e89ea2aa	[Inductor][CPU] Fuse SmoothQuant int8 linear pattern (#139595 ) About the PR In the implementation of SmoothQuant in Torchao, quantized linear is computed by `_int_mm(a, b)` + `mul(b_scale)` + `mul(a_scale)` (+ optional `add` for bias) with `reshape` and `convert_dtype` in between. This PR adds a pass to fuse the corresponding patterns: - (no bias) `reshape -> _int_mm -> convert_element_type -> (expand -> mul) -> mul -> reshape` - (with bias) `pattern_no_bias -> add -> reshape -> reshape` The patterns are replaced by `onednn.qlinear_pointwise` and `onednn.qlinear_prepack`, the latter of which is evaluated and frozen during the freezing process of Inductor. The final graph contains `onednn.qlinear_pointwise` only with packed weight constants. Note that `onednn.qlinear_pointwise` does not support per-channel quantization of activation, which is a limitation of oneDNN library, so in that case we set activation scale to 1 and bias to none and apply scales and add bias after `onednn.qlinear_pointwise`. Validation results Accuracy/perplexity is not changed with or without this fusion pass. Latency is improved by >10% with the fusion pass. Test method: - Model: EleutherAI/gpt-j-6b - Hardware: Intel(R) Xeon(R) Platinum 8490H, running on 1 socket, 60 cores - Using Intel OMP and Tcmalloc - Running [the example script of SmoothQuant in Torchao](https://github.com/pytorch/ao/blob/main/torchao/prototype/smoothquant/example.py) with `TORCHINDUCTOR_FREEZING=1 numactl -N1 python example.py -m EleutherAI/gpt-j-6b --device=cpu --quant-mode=dynamic --compile` Test plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_smooth_quant_with_int_mm ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139595 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-11-06 07:54:47 +00:00
Huy Do	c19c384690	Fix torch.load (torch.utils.benchmark) after #137602 (#139810 ) After #137602, the default `weights_only` has been set to True. This test is failing in trunk slow jobs atm benchmark_utils/test_benchmark_utils.py::TestBenchmarkUtils::test_collect_callgrind [GH job link](https://github.com/pytorch/pytorch/actions/runs/11672436111/job/32502454946) [HUD commit link](`1aa71be56c`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139810 Approved by: https://github.com/kit1980	2024-11-06 03:08:29 +00:00
Colin Peppler	63b01f328e	[inductor] support masked_scatter w/ unbacked sized source (#138083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138083 Approved by: https://github.com/jansel	2024-11-06 02:16:25 +00:00
cyy	028c5d3426	[2/N] Replace c10::sv with std::sv (#139456 ) Follows #139453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139456 Approved by: https://github.com/ezyang	2024-11-06 01:50:38 +00:00
Andrew Gu	39ede99a33	Add current FSDP2 path to old composable FSDP1 warning (#139759 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139759 Approved by: https://github.com/weifengpy, https://github.com/wz337 ghstack dependencies: #139650	2024-11-06 01:43:04 +00:00
David Berard	aec179e2be	Fix docs for logcumsumexp formula (#139768 ) The previous formula was wrong and reused some indexing variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139768 Approved by: https://github.com/janeyx99	2024-11-06 01:19:09 +00:00
Laith Sakka	a787320d0f	Do not try to optimize new implications in get_implications (#139738 ) Summary: save around 8% on the torchrec model. In most case the new implications are not optimizaiton anyway in some case though they are, but optimizing them is useless. ex: ``` generating implications for Eq(Mod(s0, 3), 0) adding Eq(Mod(s0, 3), 0) adding Eq(0, Mod(s0, 3)) adding Ne(Mod(s0, 3), 0) adding Ne(0, Mod(s0, 3)) adding Mod(s0, 3) <= 0 adding 0 < Mod(s0, 3) adding True adding False ``` VS ``` generating implications for Eq(Mod(s0, 3), 0) adding Eq(Mod(s0, 3), 0) adding Eq(0, Mod(s0, 3)) adding Ne(Mod(s0, 3), 0) adding Ne(0, Mod(s0, 3)) adding Mod(s0, 3) <= 0 adding 0 < Mod(s0, 3) adding 0 <= Mod(s0, 3) adding Mod(s0, 3) < 0 ``` the main difference is that 0 <= Mod(s0, 3) can be simplified to True and Mod(s0, 3) < 0 to False but with this change this wont happen. but True:True and False: False are useless anyway lol. so its ok i think ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000 ``` <img width="1082" alt="Screenshot 2024-11-04 at 9 25 51 PM" src="https://github.com/user-attachments/assets/a26e291b-9280-4b55-9275-f3201a36ac51"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139738 Approved by: https://github.com/ezyang ghstack dependencies: #139703	2024-11-06 00:23:40 +00:00
Will Feng	6a30c14a0a	[Traceable FSDP2] Run any unexecuted post_backward at beginning of pre_backward hook (#139671 ) Assuming the forward pass user code looks like: ``` for _ in range(2): x = layer(x) ``` and we have `fully_shard(layer)`, then: - the forward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" (currently same for both eager and compile) - the backward pass will be like: "unshard layer -> call layer 1st time -> reshard layer -> unshard layer -> call layer 2nd time-> reshard layer" in eager, but currently it's "unshard layer -> call layer 1st time -> call layer 2nd time -> reshard layer" in compile The behavior in the backward pass is different between eager and compile, which is not ideal. I am currently trying to look for a way to fix this non-ideal behavior of compile - tried a few things: 1. Tracing the RegisterPostBackwardFunction custom autograd function - this stills seems to be a no-go, due to HOP not supporting side-effects. 2. Instead of custom autograd function, do a "multi-grad hook" to wait for all gradients to be ready before triggering post_backward. However, this approach seems to have bad interaction with register_hook of pre_backward, in the sense that it's unclear which of them will be triggered first in practice. 3. Force execute any pending post_backward before unshard in pre_backward hook, and rely on compiler to move the reshard to the right place to optimize peak memory. -> This PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/139671 Approved by: https://github.com/awgu	2024-11-06 00:19:06 +00:00
Xiaodong Wang	e7cf7d00be	Support torch.bool in torch.sort + CUDA (#139409 ) Summary: This might be out-dated, so I'm adding it back and see if we pass all the tests. I'm pretty sure cuda12 is ok. Test Plan: CI Differential Revision: D65282650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139409 Approved by: https://github.com/zou3519, https://github.com/ngimel, https://github.com/eqy	2024-11-06 00:02:54 +00:00
Aaron Orenstein	06f619d999	typing ir.py - part 2 (#131846 ) See #131852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131846 Approved by: https://github.com/eellison ghstack dependencies: #139238	2024-11-06 00:01:15 +00:00
Aaron Orenstein	c2109ec479	typing ir.py - Disallow untyped defs for ir.py (#139238 ) - Remove "mypy: allow-untyped-defs" and mark functions individually with "no-untyped-def" - Mark some trivial functions with the proper return types (`None` and `torch.dtype`) - Fixed a type bug in the signature of supported_dtype_of_cpp_wrapper() - `ruff check torch/_inductor/ir.py --select ANN --fix --unsafe-fixes` and then fixed up things that looked incorrectly applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139238 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-06 00:01:15 +00:00
leslie-fang-intel	82e4de4994	[Inductor][CPU] Enable the oneDNN Linear fusion for special case (#139172 ) Summary In the case of LLaMA2, for a linear operation with an activation size of `(4, 1, 4096)` and a stride of `(4096, 128, 1)` which has been decomposed into `matmul`. And the decomposition of `matmul` results in `bmm` due to a strict continuity check. We can align the continuity check with ATen by skip dim of size 1 to enable decomposition into `mm` instead. Test Plan ``` python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_input_non_contiguous_3D_wo_bias ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139172 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-11-05 23:49:53 +00:00
Thomas Bohnstingl	d1c26b0781	Improvements for associative_scan - slicing of xs (#138858 ) In this PR, the combine_fn is consistently called with a slice along the scan dim. It implements part of https://github.com/pytorch/pytorch/pull/136966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138858 Approved by: https://github.com/ydwu4	2024-11-05 23:38:21 +00:00
Mikayla Gawarecki	86d7d39bff	Forward fix D65441551 for T206731737 (#139767 ) Test Plan: - Differential Revision: D65482429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139767 Approved by: https://github.com/awgu	2024-11-05 23:19:08 +00:00
Shuqiang Zhang	c0d642a295	[pgnccl][simple] log started work numel (#139773 ) Summary: We saw some cases that the same work was started on multiple ranks, but did not complete. This info could give us more info if the numel matches Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139773 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-11-05 23:11:19 +00:00
PyTorch MergeBot	1d28b8b6d5	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit `e84d1121ad`. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. More details in D65483292 ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2458381056))	2024-11-05 23:10:38 +00:00
drisspg	16da289402	[Workspace Inductor] Fix dynamic shapes (#139777 ) # Summary Arg ordering was wrong for when dynamic shapes is enabled and we pass in the additional size args Pull Request resolved: https://github.com/pytorch/pytorch/pull/139777 Approved by: https://github.com/eellison ghstack dependencies: #139157	2024-11-05 22:34:09 +00:00
Animesh Jain	b09eb6ed6a	[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 ) This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476 We have dict tag optimization where if the dict tag does not change, we skip guards on all the items of the dict that are "immutable". We considered tensors as immutable in such scenarios. This is critical for guard eval performance, because generally users dont change their parameters. If I try to remove this optimization, we see slowdowns, e.g, 3.03x to 2.95x on conv_mixer TIMM benchamrk. So, I am adding a flag which keeps the current state but allows the users to remove this optimization. Not ideal, but given how serious guard eval perf has to be, we are in the gray are of unsoundness vs performance tradeoff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560 Approved by: https://github.com/jansel	2024-11-05 21:48:07 +00:00
Yidi Wu	6734cb7bf2	[hop free symbols] refactor tensor.to_list implementation to call wrap_fx_proxy. (#139663 ) Refactoring only. Previously, we manually cal SymNodeVariable.create, now we handle it with wrap_fx_proxy. This unifies the handling of operations that produce symints in wrap_fx_proxy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139663 Approved by: https://github.com/zou3519 ghstack dependencies: #138345, #138428, #138558, #138737, #138559	2024-11-05 20:19:09 +00:00
rzou	b9f0563aaf	Add repro instructions to fx_graph_runnable.py (#139481 ) This PR adds some instructions for how to add a TARGETS file to run the fx_graph_runnable script. I'm planning to add some followups that will add additional imports for custom ops and use autodeps to get the dependencies, but I figure this PR is an easy first step. Test Plan: - pytest test/dynamo/test_structured_trace.py - Does anyone have suggestions for how to test this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/139481 Approved by: https://github.com/eellison	2024-11-05 19:24:16 +00:00
Ryan Guo	01bcf37123	[dynamo][NFC] Remove some dead code paths (#139674 ) As title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139674 Approved by: https://github.com/Skylion007, https://github.com/anijain2305, https://github.com/mlazos	2024-11-05 19:12:17 +00:00
Ryan Guo	2b3a227b35	[dynamo] Add `is_mutable()` and `is_immutable()` methods to `VariableTracker` (#139341 ) This patch adds 2 simple methods `VariableTracker.is_mutable()` and `VariableTracker.is_immutable()`, which helps clarify intention. For instance, rather than writing ```python if var.mutation_type: ... ``` After this patch one can write ```python if var.is_mutable(): ... ``` This patch also simplifies `mutation_type` propagation in some `ListVariable` methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139341 Approved by: https://github.com/mlazos, https://github.com/anijain2305 ghstack dependencies: #139339, #139340	2024-11-05 19:11:41 +00:00
Ryan Guo	0ba3962b80	[dynamo][NFC] Move `MutationType` classes into `variables/base.py` (#139340 ) As title, this addresses https://github.com/pytorch/pytorch/pull/137905/files#r1806800222. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139340 Approved by: https://github.com/anijain2305 ghstack dependencies: #139339	2024-11-05 19:11:41 +00:00
Ryan Guo	693a0a1bd4	[dynamo][NFC] Rename `mutable_local` and add documentation (#139339 ) This patch addresses the renaming part of #133027, specifically, it renames the following and adds documentation for relevant classes. 1. `VariableTracker.mutable_local` to `mutation_type` 2. `MatableLocal `to `ValueMutationNew` 3. `MutableSideEffects `to `ValueMutationExisting` 4. `MutableLocalSource` to `SourceType` 5. `MutableLocalSource.Local` to `New` Note that (2), (3) and (5) are mainly to bring consistency between them and `AttributeMutationNew`, `AttributeMutationExisting`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139339 Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305	2024-11-05 19:11:41 +00:00
Ke Wen	5f2ed505eb	[PGNCCL] Watchdog prints call-time traceback when reporting timeout (#139659 ) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-11-05 19:07:17 +00:00
Yifu Wang	ee42a99745	[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility. To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`: - `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`. - `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755 Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw	2024-11-05 18:47:24 +00:00
Boyuan Feng	87059d4547	[AOTAutograd] Handle edge cases for donated buffer & enable in oss (#139669 ) This PR enables donated buffer in OSS and handles two edge cases: 1. While donated buffer relies on storage to check alias, sparse tensor subclasses does not provide access to storage. So we skip sparse tensor subclasses for donated buffer. 2. Handles missing "val" from n.meta. This is observed from `inductor/test_fused_attention.py::SDPAPatternRewriterCpuTests::test_sdpa_rewriter_11_cpu`, `functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_simple_with_none_and_nontensor`, and `inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139669 Approved by: https://github.com/bdhirsh	2024-11-05 18:38:20 +00:00
rzou	27ec3921bc	Optimize mutable torch.library.custom_op overhead (#139513 ) We don't need to do a loop over all the args, kwargs in the AdInplaceOrView key; we just need to bump the version on the args, kwargs that are mutable. On the benchmark mentioned in https://github.com/pytorch/pytorch/issues/139494 this made the time go from ``` mutate2 = 61.72943878173828 no_mutate2 = 36.89440155029297 mutate = 236.3092498779297 no_mutate = 59.31964874267578 ``` to ``` mutate2 = 47.976478576660156 no_mutate2 = 38.37468719482422 mutate = 71.21315002441406 no_mutate = 59.7432975769043 ``` Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139513 Approved by: https://github.com/bdhirsh ghstack dependencies: #139509	2024-11-05 18:30:53 +00:00
Tomasz Bohutyn	9dc5851f5d	handle more devices in method_type method of TensorVariable (#138078 ) Fixes #138077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138078 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-11-05 18:19:52 +00:00
Angela Yi	de509abe1c	[export] Dedup data-dependent errors based on stacktrace (#139540 ) Summary: Dedup the data-dependent errors based on the stacktrace it points to. Right now we just display every propagate-real-tensor log that shows up, but we actually can dedup them if they are due to the same piece of code (ex. there could multiple calls to a piece of code that does some data dependent computation). This occurred when trying out draft export on the PT2I model zoo. For a specific model, previously we would get ~3k data dependent errors, but after deduping based on the stacktrace we now only get 4 errors. Test Plan: CI Differential Revision: D65374254 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139540 Approved by: https://github.com/pianpwk, https://github.com/zou3519	2024-11-05 18:16:05 +00:00
Sam Ginzburg	cc25b6d7ba	[inductor] Error on unsupported autotuner configs (#139658 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139658 Approved by: https://github.com/aakhundov	2024-11-05 18:09:02 +00:00
Junjie Wang (PyTorch)	41e4d88584	[logging][ez] Add timer logging for pickling and unpickle for object based collective (#139757 ) Summary: As discussed, we want to measure the time spent during pickling and unpickle. Test Plan: CI Reviewed By: wz337 Differential Revision: D65462767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139757 Approved by: https://github.com/awgu, https://github.com/Skylion007, https://github.com/fegin, https://github.com/c-p-i-o	2024-11-05 17:40:27 +00:00
Oguz Ulgen	c0d21b6581	End TritonBundle on non-cache write codepaths (#139698 ) Summary: When we bypass cache write on inductor, we were also forgetting to reset the bundle, this moves resetting the bundle into post_compile step so it gets uniformly reset. This diff also turns on the cache for internal so that we can do a code rollout. Test Plan: updated tests Differential Revision: D65457224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139698 Approved by: https://github.com/ezyang	2024-11-05 17:00:40 +00:00
PyTorch MergeBot	4d5cc1b4ef	Revert "[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 )" This reverts commit `e6ff07f00e`. Reverted https://github.com/pytorch/pytorch/pull/139560 on behalf of https://github.com/ZainRizvi due to Sorry but this seems to be breaking internal tests. Please see D65430317 for more details ([comment](https://github.com/pytorch/pytorch/pull/139560#issuecomment-2457620720))	2024-11-05 16:22:30 +00:00
cyy	a2bc2e38f9	Use clang-tidy 17 (#139678 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139678 Approved by: https://github.com/Skylion007	2024-11-05 16:00:25 +00:00
Junjie Wang (PyTorch)	13eb3b3f6f	[Torch Elastic] Fix the bug caused by wrong host address in creating TCPStore server inside dynamic rendezvous (#139702 ) Summary: During dynamic rendezvous, we shouldn't use the address from the store but just use `self._this_node.addr` directly because sometimes, the store host is not the host of rank0. Passing wrong host will cause timeout error. This is a follow up fix to S463164, for internal tests, we disable the TCPStore sharing for now. Test Plan: CI. Differential Revision: D65453312 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139702 Approved by: https://github.com/XilunWu	2024-11-05 15:28:03 +00:00
Edward Z. Yang	349cd49406	Fix compiler collective TORCH_TRACE and improve code state printing (#139716 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139716 Approved by: https://github.com/yf225	2024-11-05 14:32:52 +00:00
cyy	546318e559	[7/N] Don't skip ASAN on some tests (#139675 ) Follows #139565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139675 Approved by: https://github.com/ezyang	2024-11-05 14:01:01 +00:00
Xuehai Pan	e84d1121ad	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-11-05 10:44:56 +00:00
zeshengzong	ffb7a08921	Fix torch.histc not checking min > max on cuda for int8 tensors (#139372 ) Fixes #139360 `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L323-L324)` Assign `min` and `max` to with low-precision input_t variable `minvalue` and `maxvalue` cause wrong comparing result in following check in here: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L353)` ![image](https://github.com/user-attachments/assets/0d5c87f4-3dc6-48bb-bcc8-b1803e7cd487) Change type of `minvalue` and `maxvalue` to fix it, similar like in line: `86e6513c86/aten/src/ATen/native/cuda/SummaryOps.cu (L280-L282)` Test Result ```bash $ pytest test/test_reductions.py -vv ``` ![image](https://github.com/user-attachments/assets/6b5d0d48-ebc2-4a8c-85f4-dbad147c086c) ```bash $ lintrunner ``` ![image](https://github.com/user-attachments/assets/f97c2d6d-78ea-4439-a1ba-907bc9defad7) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139372 Approved by: https://github.com/eqy	2024-11-05 08:42:38 +00:00
Laith Sakka	6ad52db8c8	use torch.sym_sum instead of incremental sum in _cat_meta (#139653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139653 Approved by: https://github.com/ezyang	2024-11-05 07:24:24 +00:00
Aaron Orenstein	51a3d6dbc3	Fix existing lint issues in ir.py (#139237 ) - Remove stale mypy "type: ignores" - Made ir.py pass the rest of the lints Pull Request resolved: https://github.com/pytorch/pytorch/pull/139237 Approved by: https://github.com/Skylion007	2024-11-05 06:06:12 +00:00
Eli Simhayev	b2f5a5311b	RMSNorms docs - remove biases initialization (#139620 ) RMSNorm doesn't use a bias in `elementwise_affine`, so I've removed it from the documentation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139620 Approved by: https://github.com/mikaylagawarecki	2024-11-05 05:59:41 +00:00
Chen, Zejun	9aaf3a04fa	[profiler][UT] instantiate profiler UTs for devices and enable UTs for xpu profiler (#134316 ) This PR enables the profiler related UT to be device-agnostic. It instantiates the profiler UTs for different device types and enable them on XPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134316 Approved by: https://github.com/etaf, https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-11-05 05:46:13 +00:00
CaoE	9e14d86573	[Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255 ) `kernel_micro_gemm` generated using BRGEMM: ``` template <bool accum> inline void kernel_micro_gemm( const half* __restrict__ A, const half* __restrict__ B, float* __restrict__ C, int64_t M, int64_t N, int64_t K, int64_t lda, int64_t ldb, int64_t ldc ) { at::native::cpublas::brgemm( M, N, K, lda, ldb, ldc, 1.f, accum ? 1.f : 0.f, A, B, C); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-05 05:33:29 +00:00
Meet Vadakkanchery	c8a55eea88	[DCP] Fix process_group logging for DCP methods (#139428 ) Summary: Currently, we incorrectly log process_group for DCP based events. We rely on [c10d_logger.py](https://fburl.com/v4mdme9z) to fill in information about process_group (e.g. backend, nccl_version if available). In [checkpoint/logger.py](https://fburl.com/yho9nqbu) we pass the `msg_dict` to c10d_logger which never contains the `process_group` param, so [c10d_logger](https://fburl.com/zlw2ukxp) logs information about the default process_group which is always `NCCL`. Test Plan: Before: Always defaults to NCCL even though GLOO is passed by caller. {F1950847585} After: GLOO backend shows up. {F1950848375} Differential Revision: D65255871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139428 Approved by: https://github.com/teja-rao, https://github.com/mhorowitz	2024-11-05 05:24:38 +00:00
Animesh Jain	fe4fa1df9f	[dynamo][eval_frame] Set the callback to None earlier for guard eval (#139655 ) xref - https://fb.workplace.com/groups/1075192433118967/permalink/1536570810314458/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/139655 Approved by: https://github.com/jansel, https://github.com/williamwen42	2024-11-05 05:18:46 +00:00
Gabriel Ferns	a766d84a3c	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-11-05 03:44:09 +00:00
Andrew Gu	9039fbb47e	[FSDP2] Make module-to-state mapping use weakrefs (#139650 ) Without this, `del model` does not free memory of a module with FSDP2 applied. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139650 Approved by: https://github.com/yf225	2024-11-05 02:16:52 +00:00
cyy	5008d15ae9	[2/N] Remove usage of C array (#139589 ) Follows #139567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139589 Approved by: https://github.com/ezyang	2024-11-05 01:58:12 +00:00
CaoE	3672c688e3	Fix layout for SetSourceTensorKernel (#137973 ) Fixes #136837. `aten.set_.source_Tensor` will make the size and stride of the first input and output follow that of the second input: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/TensorShape.cpp#L440. If the layouts of the two inputs are different, the following `assert_size_stride` will fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137973 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-05 00:55:17 +00:00
Edward Yang	639162f39a	Add cache size to pt2_compile_events (#139627 ) Summary: I realized I wanted to check "are my cache entries/IO unreasonably large" and there's no easy way to do it. This lets me do it. Test Plan: servicelab Differential Revision: D65390363 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139627 Approved by: https://github.com/c00w	2024-11-05 00:30:10 +00:00
Nikita Shulga	0058f71002	Don't use deprecated type properties in UpsampleKernel (#139399 ) By replacing `at::CPU(dtype)` pattern with `at::device(kCPU).dtype(dtype)` pattern Pull Request resolved: https://github.com/pytorch/pytorch/pull/139399 Approved by: https://github.com/Skylion007 ghstack dependencies: #139353, #139358	2024-11-05 00:29:58 +00:00
PyTorch MergeBot	4a3ee96427	Revert "Don't use deprecated type properties in UpsampleKernel (#139399 )" This reverts commit `9d096e4d9f`. Reverted https://github.com/pytorch/pytorch/pull/139399 on behalf of https://github.com/ZainRizvi due to Change reverted internally due to broken builds. See D65378845 ([comment](https://github.com/pytorch/pytorch/pull/139358#issuecomment-2455959040))	2024-11-05 00:13:48 +00:00
cyy	64d9ee88d7	[11/N] Fix extra warnings brought by clang-tidy-17 (#139599 ) Follows #139385 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139599 Approved by: https://github.com/sraikund16	2024-11-04 23:57:41 +00:00
Laith Sakka	3f248a5735	Classify miss-inplaced tensors in logs. (#139240 ) Summary: use signpost logs, a followup is to remove the field possibly_missed_reinplacing_opportunities form dynamo compile table. Differential Revision: D65180194 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139240 Approved by: https://github.com/zou3519	2024-11-04 23:56:14 +00:00
Mikayla Gawarecki	e947649e8f	[BE] Change _marked_safe_globals_list to set (#139303 ) Prevent same global from being added multiple times Pull Request resolved: https://github.com/pytorch/pytorch/pull/139303 Approved by: https://github.com/janeyx99 ghstack dependencies: #138936, #139221, #139433, #139541, #137602	2024-11-04 23:50:55 +00:00
Pian Pawakapan	a678eaf1ad	check fake/real mismatches during real tensor prop (#137747 ) Summary: While testing exportability for PT2 Inference models, we found various cases of invalid op inputs during tracing, for example errors like: `a and b must have same reduction dim`, `expected scalar type Long but found Int`, etc. Looking more closely, these happened to due the same few meta kernels & eager kernels producing mismatched outputs upstream (e.g. different output tensor dtype, int output). Adding checks to catch mismatched outputs in real tensor prop upstream, so errors are raised at the mismatched op, instead of the downstream ops taking them as inputs. Relies a lot on utils from [CrossRefFakeMode](`929797dedb/torch/_subclasses/fake_utils.py (L78)`) Follow ups: could add more checks, and maybe have a flag to only enable these for cases like draft mode, so perf doesn't suffer? Test Plan: test_export, test_fake_tensor Differential Revision: D64210055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/137747 Approved by: https://github.com/zou3519	2024-11-04 23:39:48 +00:00
Bob Ren	9919932783	Specialize symfloats that flow through is_integer (#139572 ) Fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_number_method_method_is_integer_num_type6_dynamic_shapes` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139572 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457, #139568	2024-11-04 23:35:35 +00:00
Henry Tsang	350bc2a166	[export] Add support for symbool to make it usable for torch.cond (#138765 ) # Why? I want the following code to work. minimal repro: ``` class M(torch.nn.Module): def forward(self, dilate_flag): return dilate_flag.item() input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),) model = M().cuda() ep = torch.export.export(model, input1, strict=True) path = torch._inductor.aot_compile(ep.module(), input1) aot_model = torch._export.aot_load(path, device="cuda") actual_output = aot_model(input1) ``` error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program second error will be handled by https://github.com/pytorch/pytorch/pull/138760 # Motivation I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work. ``` class M(torch.nn.Module): def __init__(self) -> None: super().__init__() self.dilate_flag = 0 def forward(self, dilate_flag): self.dilate_flag = dilate_flag.item() def true_fn(dilate_flag): return dilate_flag.clone() def false_fn(dilate_flag): return dilate_flag.clone() torch.cond( self.dilate_flag, true_fn, false_fn, (dilate_flag,), ) return self.dilate_flag input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),) input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),) inputs = (input1, input2) model = M().cuda() for input in inputs: expected_output = model(input) ep = torch.export.export(model, input, strict=False) path = torch._inductor.aot_compile(ep.module(), input) aot_model = torch._export.aot_load(path, device="cuda") actual_output = aot_model(*input) assert ( expected_output == actual_output ), f"henry they are not equal {expected_output} != {actual_output}" ``` Differential Revision: D64867504 Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765 Approved by: https://github.com/ydwu4	2024-11-04 23:31:49 +00:00
PyTorch MergeBot	6add86a29f	Revert "Tighten type hints for tensor arithmetic (#135392 )" This reverts commit `bf5cd8d011`. Reverted https://github.com/pytorch/pytorch/pull/135392 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking lint on trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/11673543178/job/32504499599) [HUD commit link](`bf5cd8d011`) ([comment](https://github.com/pytorch/pytorch/pull/135392#issuecomment-2455908056))	2024-11-04 23:30:15 +00:00
Jane Xu	23169a6bcc	Disable foreach tests for complex128 internally (#139649 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139649 Approved by: https://github.com/ngimel	2024-11-04 23:24:47 +00:00
Tugsbayasgalan Manlaibaatar	87a379b61b	Move pippy to training IR (#139233 ) Differential Revision: [D65282662](https://our.internmc.facebook.com/intern/diff/D65282662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139233 Approved by: https://github.com/kwen2501 ghstack dependencies: #138658, #139209	2024-11-04 23:07:14 +00:00
Yidi Wu	397938b453	[hop free symbols][refactor] lift freevar to parent graph before lifting to subgraph (#138559 ) This refactoring is for getting a deterministic ordering of binding tensors and sizes of tensors. When seeing a free tensor x with shape (s0,) in subgraph, the ordering of lifting changes from ``` lift_x_in_child, lift_s0_in_child, lift_s0_in_parent, lift_x_in_parent ``` to ``` lift_x_in_parent, lift_s0_in_parent, lift_x_in_child, lift_s0_in_child ``` This produces a determinstic ordering of handling the symints in lifted tensors. This is also the current contract of dynamo top-level graph: we lift free_symbols in sizes after tensor x and insert the free symbols before the tensor x's proxy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138559 Approved by: https://github.com/zou3519 ghstack dependencies: #138345, #138428, #138558, #138737	2024-11-04 22:48:14 +00:00
Yidi Wu	c5b79699e1	[hop free symbols] replace ctx.save_for_backward to support symints/ints (#138737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138737 Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Chillee ghstack dependencies: #138345, #138428, #138558	2024-11-04 22:48:14 +00:00
Yidi Wu	ac20d0f893	[hop free symbols][refactor] make map's save_for_backward to handle int (#138558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138558 Approved by: https://github.com/zou3519 ghstack dependencies: #138345, #138428	2024-11-04 22:48:07 +00:00
Yidi Wu	dc3a6a9d08	[hop free symbols][refactor] make create_graph_input always take example_value (#138428 ) Code refactoring only. We move the wrap_to_fake_tensor_logic out of wrap_fx_proxy for placeholders to provide the invariant that all graph inputs must set their example values when creating the inputs. This invariant helps us to identify all the free symbols in the graph in top-level and sub-graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138428 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #138345	2024-11-04 22:47:49 +00:00
Yidi Wu	54c69a785b	[hop free symbols][refactor] make bound_symbols a dictionary (#138345 ) Code refactoring only. Change all self.tx.output.bound_symbols to self.tx.output.root_tracer.bound_symbols. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138345 Approved by: https://github.com/zou3519	2024-11-04 22:47:41 +00:00
Felix Zimmermann	bf5cd8d011	Tighten type hints for tensor arithmetic (#135392 ) Fixes #124015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135392 Approved by: https://github.com/ezyang	2024-11-04 22:10:04 +00:00
Shunting Zhang	888110841c	[inductor] don't fuse two nodes if likely increase peak memory (#138756 ) Partially fixing https://github.com/pytorch/pytorch/issues/138685 Add a (relatively safe?) heuristics to skip fusion if we can potentially increasing peak memory. The doc string mainly explains what this PR is doing: ``` The implementation is more like a heuristic since we don't really know if we are at peak or not when trying to fuse these two ndoes. The order of nodes may change later which makes the peak memory estimation hard. Here is how we decide the LOWER BOUND of extra memory allocation if we fuse these 2 nodes: 1. find all buffers read by each node with a single user. These buffers are supposed to be reused if we don't fuses these 2 nodes 2. find the intersection of these buffers for the two node and sum the total buffer size. If we don't fuse these two nodes, we can at lease avoid this much memory allocation. Note that the extra memory allocation is not necessarily causing peak memory increase. This is just a heuristic. We return true only if the saving for fusion can not trade off the extra memory allocation. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138756 Approved by: https://github.com/jansel ghstack dependencies: #139136	2024-11-04 20:49:29 +00:00
Ze Sheng	1aa71be56c	[PT2] Decouple decompose_triton_kernel_wrapper_functional from decompose_auto_functionalized (#139526 ) As title. We may not always want to remove the `triton_kernel_wrapper_functional` for example the references of [`unsafe_remove_auto_functionalized_pass`](`c8ab9b06a2/torch/export/_remove_auto_functionalized_pass.py (L48)`). Test Plan: CI & [D62592946](https://www.internalfb.com/diff/D62592946) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139526 Approved by: https://github.com/zou3519	2024-11-04 20:16:18 +00:00
Will Constable	71dc5df93c	[pipelining] Fix 'last backward' counting for dI / dW (#139415 ) Since any stage can run a mixture of full backwards and split backwards, it is important to count the sum of (full_backwards + backward_weight) when comparing to num microbatches to determine last backward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139415 Approved by: https://github.com/H-Huang	2024-11-04 20:14:10 +00:00
Ryan Guo	30a83ca991	[dynamo] Improve codegen for `DataPtrVariable` and fix tensor reference issue (#139487 ) This addresses https://github.com/pytorch/pytorch/pull/137677/files#r1799836499, which had to set `allow_cache=False` for codegen on `DataPtrVariable.base`, which is a `TensorVariable`, otherwise we observe failure of `test_no_grad_copy` when testing with Dynamo. I've seen `test_no_grad_copy` failing a few times, and every single time it's related to cyclic reference, my best guess is the cyclic reference holds some tensor object longer in memory than necessary, preventing the optimization introduced in #11165. This patch makes `OutputGraph.cleanup()` more aggressive by clearing out all fields that might reference a `VariableTracker`. As a result, we can remove the aforementioned `allow_cache=False`, which helps generate better code (e.g., in the case of `test_no_grad_copy`, it skipped generating a redundant graph whose only op is returning the input tensor; instead we just generate a single `LOAD_FAST`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/139487 Approved by: https://github.com/jansel, https://github.com/aakhundov	2024-11-04 19:14:06 +00:00
Bin Bao	740054ffe6	[AOTI][reland] Switch OSS dashboard to use aoti_compile_and_package (#139597 ) Summary: Reland https://github.com/pytorch/pytorch/pull/139154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139597 Approved by: https://github.com/angelayi	2024-11-04 18:53:17 +00:00
Oguz Ulgen	e76ce20177	Log to pt2 compile events (#139601 ) Summary: This option was added after I wrote the original diff, lets publish to pt2_compile_events Test Plan: CI Differential Revision: D65404910 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139601 Approved by: https://github.com/jamesjwu	2024-11-04 18:39:06 +00:00
Shunting Zhang	4930c4b716	[inductor] patterns to remove pointless view/permute pairs (#139136 ) These are not artificial patterns I come up. They shows up in linear+CrossEntropyLoss graph. Consider this snippet: ``` class LinearAndCEL(nn.Module): def __init__(self): super().__init__() self.linear = nn.Linear(C, V) self.ce = nn.CrossEntropyLoss() def forward(self, x, y): return self.ce(self.linear(x).view(B * T, V), y.view(-1)) ``` `x` passed to `forward` is a 3D tensor of shape [B, T, C]. The `self.linear` will view x as [BxT, C] shape tensor first, do the matmul and produce a [BxT, V] tensor, and then view this output back to a 3D tensor with shape [B, T, V]. User code is gonna add another view op to convert the tensor shape to [B x T, V]. This generates a pair of redundant views . A pair of redundant permute happens in the backward part when we compute gradients. The view ops makes it hard to chunk linear+CEL. When the view op breaks up the dimension being chunked, what should the chunker do (even if we merge those dimension again later)? Removing these pointless view pairs makes the chunker simpler. And I think it's in general nice to do. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139136 Approved by: https://github.com/Chillee, https://github.com/jansel	2024-11-04 18:39:02 +00:00
Mikayla Gawarecki	ca43ecd599	Flip default on weights_only (#137602 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137602 Approved by: https://github.com/malfet, https://github.com/albanD ghstack dependencies: #138936, #139221, #139433, #139541	2024-11-04 18:30:29 +00:00
Mikayla Gawarecki	f55dfbcf87	Remove hasattr(__slots__) for BUILD logic in weights_only unpickler (#139541 ) This is tested in PR stacked above in ```python python test/distributed/fsdp/test_fsdp_state_dict.py TestFSDPStateDict.test_torch_save_load ``` We cannot depend on whether `hasattr(..., __slots__)` to know whether a BUILD instruction has slotstate. For example, if a class subclasses ABC `hasattr(__slots__)` will be `True` but there might be no slots (and hence `state` will not be a tuple). So revert #138936 to following the pickle library's code ```python >>> from abc import ABC >>> hasattr(ABC, "__slots__") True ``` So ```python import torch from abc import ABC from dataclasses import dataclass class Foo(ABC): pass class FooWrapper(Foo): def __init__(self, x, y): self.x = x self.y = y f = FooWrapper(1, 2) torch.save(f, "temp.pt") with torch.serialization.safe_globals([FooWrapper]): torch.load("temp.pt") ``` Would fail on the previous code with ``` File "/data/users/mg1998/pytorch/torch/serialization.py", line 1934, in _load result = unpickler.load() File "/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py", line 366, in load for k, v in slotstate.items(): ``` As there is actually no slotstate Pull Request resolved: https://github.com/pytorch/pytorch/pull/139541 Approved by: https://github.com/malfet ghstack dependencies: #138936, #139221, #139433	2024-11-04 18:30:29 +00:00
Tugsbayasgalan Manlaibaatar	ae0e7042f6	Fix custom obj being input (#139209 ) Differential Revision: [D65158939](https://our.internmc.facebook.com/intern/diff/D65158939) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139209 Approved by: https://github.com/ydwu4 ghstack dependencies: #138658	2024-11-04 18:24:29 +00:00
rzou	85c3c4132d	no-op torch.library.custom_op APIs on torch.deploy (#139509 ) We forgot this case in the previous PR. Fixes https://github.com/pytorch/pytorch/issues/137536 Test Plan: - better tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/139509 Approved by: https://github.com/williamwen42	2024-11-04 18:01:08 +00:00
PyTorch MergeBot	6dada2136a	Revert "Refactor FxGraphDrawer to use HTML-like labels (#137726 )" This reverts commit `1e73842029`. Reverted https://github.com/pytorch/pytorch/pull/137726 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it looks like some internal components are failing after this change and need to be updated ([comment](https://github.com/pytorch/pytorch/pull/137726#issuecomment-2455332612))	2024-11-04 17:44:44 +00:00
Tugsbayasgalan Manlaibaatar	e080c89bdc	Make test_torchbind.py training IR compatible (#138658 ) In this diff, i make test_torchbind.py tests to handle training IR. Today in the training IR, we don't see the effect token and HOP because this happens at the FunctionalTensorMode. Maybe in the future, we should move this logic up to the training IR so that writing passes etc on training Ir is safer. But for the migration purposes, i think it is ok for now. I also fixed two bugs: 1. ep.module() doesn't register all aliased constants in the module. 2. When we retrace, we need to fakify the original Torchbind object. 3. We don't run any DCE on training IR so we need to add some more torch ops to verifier. Differential Revision: [D64853530](https://our.internmc.facebook.com/intern/diff/D64853530) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138658 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17	2024-11-04 17:43:11 +00:00
Bob Ren	68c515b292	don't run z3 analysis on backed symfloat nodes (#139568 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139568 Approved by: https://github.com/ezyang ghstack dependencies: #139569, #139457	2024-11-04 17:04:29 +00:00
PyTorch MergeBot	3ca794783f	Revert "[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 )" This reverts commit `924e726c3a`. Reverted https://github.com/pytorch/pytorch/pull/138755 on behalf of https://github.com/ZainRizvi due to Sorry but this breaks internally. Can you please fix this PR so it works internally and re-merge it? See D65401876 for more details ([comment](https://github.com/pytorch/pytorch/pull/138755#issuecomment-2455173596))	2024-11-04 16:34:34 +00:00
Bob Ren	87404b6ca6	support symfloats in translation validation (#139457 ) fixes `python test/dynamo/test_dynamic_shapes.py DynamicShapesHigherOrderOpTests.test_cond_pytree_operands_with_non_tensor_leaves_dynamic_shapes` when `specialize_float=False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139457 Approved by: https://github.com/ezyang ghstack dependencies: #139569	2024-11-04 15:40:08 +00:00
Richard Barnes	6b8e3022f2	Remove c10::optional usages in PyTorch (#139525 ) Test Plan: Sandcastle Reviewed By: swolchok Pull Request resolved: https://github.com/pytorch/pytorch/pull/139525 Approved by: https://github.com/malfet, https://github.com/Skylion007	2024-11-04 15:35:23 +00:00
cyy	419a7e197d	[6/N] Fix Wextra-semi warning (#139605 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139605 Approved by: https://github.com/ezyang	2024-11-04 13:43:16 +00:00
Bob Ren	12d225d91c	add opaque unary sin and cos to SYMPY_INTERP (#139569 ) Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nn.py TestNNDeviceTypeCPU.test_affine_3d_rotateRandom_cpu` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139569 Approved by: https://github.com/ezyang	2024-11-04 07:37:11 +00:00
Sun, Jiayi	3337439dc0	[inductor] modify the heuristic for disabling vectorization (#136422 ) Summary Since we have already implemented tail loop mask vectorization (https://github.com/pytorch/pytorch/pull/126526), I re-tuned the heuristics for disabling vectorization from performance perspective. I changed the heuristic to: when the total number of elements along the vec dim is less than `tiling_factor/4` and the number of operations is less than 10, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136422 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel	2024-11-04 07:33:32 +00:00
James Wu	f4ee5a243d	Add PT2 Compile Events for triton and kernel compilation + load_by_key_path (#139402 ) Adds a few more dynamo_timed() to measure triton compilation and load_by_key_path times. In the case of async compilation with multiple threads, we'll generate a single `kernel_compile` event that occurs when waiting on all the parallel compiles to finish. In the case where async parallel compilation is disabled (or, compile threads are warming up), we'll generate a `triton_compile` event for each kernel. The `triton_compile` events is a bit questionable: do we need a row for each triton compile event? It might eat up on our already low retention, so I might just remove that. Will discuss with @slarsen. Differential Revision: [D65215707](https://our.internmc.facebook.com/intern/diff/D65215707/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139402 Approved by: https://github.com/oulgen	2024-11-04 06:37:18 +00:00
cyy	3179eb15ae	[1/N] Remove usage of C array (#139567 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139567 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-11-04 04:52:46 +00:00
Yuxin Wu	cadc50e7e9	LOG(INFO) -> VLOG(2) in ProcessGroupNCCL (#130696 ) In the same spirit as https://github.com/pytorch/pytorch/pull/105695 Initialization and error handling logs are mostly kept. Routine logs are changed to VLOG. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130696 Approved by: https://github.com/kwen2501 Co-authored-by: Ke Wen <kw2501@fb.com>	2024-11-04 04:43:42 +00:00
Jason Ansel	ed30fa74ab	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-04 04:28:40 +00:00
Jason Ansel	b6fb135c2c	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-04 04:28:40 +00:00
Jason Ansel	3d633f12ba	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-04 04:28:33 +00:00
Jason Ansel	66d5e2405d	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-04 04:28:25 +00:00
Jason Ansel	d189f92eb1	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-04 04:28:18 +00:00
Animesh Jain	e6ff07f00e	[dynamo][guards] Consider tensors as immutable for dict tag matches (#139560 ) This is a bug on the main exposed by https://github.com/pytorch/pytorch/issues/139476 We have dict tag optimization where if the dict tag does not change, we skip guards on all the items of the dict that are "immutable". We considered tensors as immutable in such scenarios. This is critical for guard eval performance, because generally users dont change their parameters. If I try to remove this optimization, we see slowdowns, e.g, 3.03x to 2.95x on conv_mixer TIMM benchamrk. So, I am adding a flag which keeps the current state but allows the users to remove this optimization. Not ideal, but given how serious guard eval perf has to be, we are in the gray are of unsoundness vs performance tradeoff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139560 Approved by: https://github.com/jansel	2024-11-04 00:54:20 +00:00
cyy	7f387fa612	[10/N] Fix extra warnings brought by clang-tidy-17 (#139385 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139385 Approved by: https://github.com/Skylion007	2024-11-04 00:47:19 +00:00
briancoutinho	3242049daa	[profiler] Annotate triton kernels with kernel hash (#139531 ) As above, annotates triton kernel hash in the profile attributes. Added a new unit test in profiler to triton/dynamo events. Testplan: Running new unit test in CI Internal: buck2 run @mode/dev-nosan caffe2/test:profiler -- -r test_pt2_triton_attributes Running on an example, this is how the kernel hash file looks ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 1670242, "tid": 1670242, "ts": 2413669097354.058, "dur": 95.812, "args": { "External id": 3,"kernel_hash": "cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu", "grid": "grid(100,)", "Record function id": 0, "stream": 0, "Concrete Inputs": ["", "", "", "100"], "kernel_file": "/tmp/torchinductor_bcoutinho/qa/cqaokwf2bph4egogzevc22vluasiyuui4i54zpemp6knbsggfbuu.py", "kernel_backend": "triton", "Input type": ["float", "float", "float", "Scalar"], "Input Strides": [[10, 1], [10, 1], [10, 1], []], "Input Dims": [[10, 10], [10, 10], [10, 10], []], "Ev Idx": 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139531 Approved by: https://github.com/davidberard98	2024-11-03 23:19:35 +00:00
Yifu Wang	924e726c3a	[SymmetricMemory] introduce a binding for cuMemset32Async (#138755 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR Make `cuMemset32Async` available via `_SymmetricMemory.memset32`. We chose `cuMemset32Async` over `cudaMemsetAsync` because it allows for `uint32_t`-wise memset. This provides users with better flexibility. To enable this, we also added the following cuda driver APIs in `c10::cuda::DriverAPI`: - `cuDevicePrimaryCtxRetain` - for obtaining the primary context of a device in the form of `CUcontext`. - `cuCtxGetCurrent`/`cuCtxSetCurrent` - for setting and restoring the context for cuda driver APIs such as `cuMemset32Async`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138755 Approved by: https://github.com/weifengpy, https://github.com/eqy, https://github.com/lw	2024-11-03 21:37:31 +00:00
Bob Ren	5d07651c72	only use hint_size in _smart_symbol_sort for size type symbols (#139571 ) Fixes `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_torch.py TestTorchDeviceTypeCPU.test_exponential_kstest_cpu_bfloat16` when specialize_float = False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139571 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482, #139484, #139486	2024-11-03 21:15:08 +00:00
leslie-fang-intel	d84a344410	[Inductor] Skip coordinate_descent_tuning for mm/bmm decomposition on CPU (#139537 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/138823, `coordinate_descent_tuning` doesn't benefit on CPU and prefer lowering `mm`/`bmm` into ATEN kernels or CPP GEMM Template. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_cpp_coordinate_descent_tuning ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139537 Approved by: https://github.com/jansel	2024-11-03 10:10:29 +00:00
Edward Z. Yang	585dbfa583	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-03 06:29:57 +00:00
Bob Ren	a1370259ba	always specialize float on export path (#139486 ) This is the next step in support dynamic float arguments in PT2: docs.google.com/document/d/1HswUSp9H6mg8Vg27mhRk8YzC9q_uf63b6wz-gwx65BQ/edit?pli=1#heading=h.xvyiqp8tuje6. To make this more incremental and tractable, we've decided to opt the export path our of this first phase of the rollout. Fixes python test/export/test_export.py TestExport.test_export_input_mutation_dynamic_shape when specialize_float=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139486 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482, #139484	2024-11-03 04:47:12 +00:00
Bob Ren	25f243ff5d	Update tensorify pass to specialize symfloats we didn't tensorify away (#139564 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139564 Approved by: https://github.com/huydhn	2024-11-03 04:27:43 +00:00
PyTorch MergeBot	067d2a089d	Revert "Expose Storage _use_count API in Python (#139426 )" This reverts commit `e31136d07b`. Reverted https://github.com/pytorch/pytorch/pull/139426 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing some inductor job in trunk ([comment](https://github.com/pytorch/pytorch/pull/139426#issuecomment-2453269063))	2024-11-03 02:40:45 +00:00
Bob Ren	b8b60e0bc5	add is_integer to support example_value function whitelist (#139484 ) Fixes python test/dynamo/test_dynamic_shapes.py DynamicShapesFunctionTests.test_is_integer_dynamic_shapes when specialize_float=False Pull Request resolved: https://github.com/pytorch/pytorch/pull/139484 Approved by: https://github.com/ezyang ghstack dependencies: #139451, #139482	2024-11-03 02:01:38 +00:00
Ke Wen	f121eab018	[c10d] Remove dead Dynamo marker (#139545 ) Per discussion with @anijain2305, `dynamo_unsupported_distributed_c10d_ops` is not referenced anywhere. Removing this dead code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139545 Approved by: https://github.com/Skylion007	2024-11-03 00:40:26 +00:00
Yukio Siraichi	a3cb8ee38b	AOTAutograd: Make general `SymInt` hashable when merging view inputs. (#139553 ) Fix: #139111 This PR wraps `SymInt` input arguments with `SymIntEqByExpr`, making them hashable when merging view inputs (`merge_view_inputs` function). Pull Request resolved: https://github.com/pytorch/pytorch/pull/139553 Approved by: https://github.com/ezyang	2024-11-02 23:57:11 +00:00
Yuanhao Ji	b46e1fc141	[Dynamo] Fix graph break when `tensor.split()` is called within a device context manager (#139270 ) Fixes: #139183 Note: this case can also be reproduced on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/139270 Approved by: https://github.com/ezyang Co-authored-by: Vincent Moens <vincentmoens@gmail.com>	2024-11-02 23:55:51 +00:00
Jane Xu	e31136d07b	Expose Storage _use_count API in Python (#139426 ) Would be nice to replace the torch._C._storage_Use_Count call in https://github.com/pytorch/torchtune/pull/1936, at least without needing to know about _cdata in OSS code. Initially keeping it private as Tensor._use_count is also private. In favor over https://github.com/pytorch/pytorch/pull/139109 in solving the same problem, as exposing an existing API is better than adding a new one (and this enables a more robust fix) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139426 Approved by: https://github.com/soulitzer	2024-11-02 23:36:31 +00:00
Bob Ren	232af152b5	Fix graph breaks related to specialized float inputs (#139482 ) Fixes issue with timm models where example_value = 0.09999 proxy.node.target = <built-in function sub> would fall through to ``` unimplemented( "torch.* op returned non-Tensor " + f"{typestr(example_value)} {proxy.node.op} {proxy.node.target}", case_name="unsupported_operator", ) ``` and graph break Pull Request resolved: https://github.com/pytorch/pytorch/pull/139482 Approved by: https://github.com/ezyang ghstack dependencies: #139451	2024-11-02 21:58:46 +00:00
PyTorch MergeBot	854be65fa0	Revert "[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 )" This reverts commit `55038aa661`. Reverted https://github.com/pytorch/pytorch/pull/139013 on behalf of https://github.com/kwen2501 due to More flavor of test_manual_with_data_parallel failed ([comment](https://github.com/pytorch/pytorch/pull/139013#issuecomment-2453085932))	2024-11-02 18:29:10 +00:00
PyTorch MergeBot	92d7f29e59	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit `f6be44c74e`. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to more fbcode errors ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452985581))	2024-11-02 13:11:04 +00:00
Edward Z. Yang	f6be44c74e	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-02 11:50:11 +00:00
Ke Wen	55038aa661	[PGNCCL] Make sure we do not use split for P2P comm creation (#139013 ) Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172 There was a split-vs-P2P bug: When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013 Approved by: https://github.com/shuqiangzhang	2024-11-02 07:47:55 +00:00
PyTorch MergeBot	2a3fe06ce0	Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598 )" This reverts commit `39ec5a20ea`. Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails an executorch test https://github.com/pytorch/executorch/blob/main/exir/backend/test/test_graph_partition.py#L114-L175 ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2452903705))	2024-11-02 07:19:22 +00:00
PyTorch MergeBot	f3238106fd	Revert "Allow inplacing buffer when other users are inconsequential (#138383 )" This reverts commit `030f70b40b`. Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))	2024-11-02 06:53:48 +00:00
PyTorch MergeBot	0863d6a08e	Revert "[inductor] Remove SIMDKernel.last_usage (#139364 )" This reverts commit `286d3ce266`. Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:11 +00:00
PyTorch MergeBot	9331640e26	Revert "[inductor] Remove Node.last_usage mutation (#139365 )" This reverts commit `1e934b473c`. Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	dc4b459737	Revert "[inductor] Move remove_kernel_local_buffers to Kernel (#139370 )" This reverts commit `b57b4b7f9b`. Reverted https://github.com/pytorch/pytorch/pull/139370 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	66a401c9e1	Revert "[inductor] Simplify remove_kernel_local_buffers (#139452 )" This reverts commit `73c0762a34`. Reverted https://github.com/pytorch/pytorch/pull/139452 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
PyTorch MergeBot	98e11b0021	Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 )" This reverts commit `c53beab377`. Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))	2024-11-02 06:49:10 +00:00
Bob Ren	fdd298dcb7	add hex method on SymFloat (#139451 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139451 Approved by: https://github.com/ezyang	2024-11-02 05:33:19 +00:00
PyTorch MergeBot	8d1eaa3da6	Revert "Profile guided optimization for automatic_dynamic (#139001 )" This reverts commit `a6630bcf87`. Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))	2024-11-02 03:38:15 +00:00
drisspg	540f3ef9b1	Fix flex_decode to build offsets off of strides (#139516 ) Fixes PR: https://github.com/pytorch/pytorch/issues/139462 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139516 Approved by: https://github.com/Chillee	2024-11-02 03:17:46 +00:00
Bin Bao	a46a79fe92	[AOTI] Ignore .o files in package_aoti (#139153 ) Summary: There is no point to package .o files since a .so file is included in that package. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139153 Approved by: https://github.com/angelayi	2024-11-02 03:10:05 +00:00
Jason Ansel	c53beab377	[inductor] sympy.Integer([01]) -> sympy.S.(Zero\|One) (#139523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523 Approved by: https://github.com/ezyang ghstack dependencies: #139364, #139365, #139370, #139452	2024-11-02 03:04:22 +00:00
Justin Chu	387b120549	[ONNX] Remove type promotion rule for pow (#139527 ) ONNX supports different input types in Pow, so type promotion is not needed. The resulting graph is the following: ```py ONNXProgram( model= < ir_version=9, opset_imports={'': 18, 'pkg.onnxscript.torch_lib.common': 1}, producer_name='pytorch', producer_version='2.6.0a0+git59a1af5', domain=None, model_version=None, > graph( name=main_graph, inputs=( %"x"<FLOAT16,[3]> ), outputs=( %"pow_1"<FLOAT16,[3]> ), ) { 0 \| # node_Constant_0 %"val_0"<?,?> ⬅️ ::Constant() {value=Tensor<FLOAT,[]>(array(2., dtype=float32), name=None)} 1 \| # node_Pow_1 %"pow_1"<FLOAT16,[3]> ⬅️ ::Pow(%"x", %"val_0") return %"pow_1"<FLOAT16,[3]> } ... , exported_program= ExportedProgram: class GraphModule(torch.nn.Module): def forward(self, x: "f16[3]"): # File: /workspace/pytorch/test/onnx/exporter/test_small_models_e2e.py:53 in forward, code: return x**2.0 pow_1: "f16[3]" = torch.ops.aten.pow.Tensor_Scalar(x, 2.0); x = None return (pow_1,) Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='pow_1'), target=None)]) Range constraints: {} ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/139527 Approved by: https://github.com/titaiwangms	2024-11-02 02:19:50 +00:00
Chen, Zejun	edd3f5a94d	[profiler] fix a building warning by adding USE_KINETO namespace for setTraceID (#139461 ) Fix: https://github.com/pytorch/pytorch/issues/139460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139461 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/sraikund16	2024-11-02 01:02:29 +00:00
Angela Yi	092fe2f422	Handle nan case when checking mutations (#139483 ) Test Plan: PT2 readiness models Differential Revision: D65340986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139483 Approved by: https://github.com/zou3519	2024-11-02 00:49:05 +00:00
William Wen	b71e813bce	[dynamo, 3.13] fix bytecode nop tests (#139323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323 Approved by: https://github.com/jansel	2024-11-02 00:39:36 +00:00
Bin Bao	8c17830dea	[AOTI] Unify how weights are stored as data section (#139471 ) Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well. Test Plan: CI Differential Revision: D65302273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471 Approved by: https://github.com/chenyang78	2024-11-02 00:23:24 +00:00
eellison	ee2f8a50d3	Class rename (#139490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490 Approved by: https://github.com/exclamaforte, https://github.com/zou3519 ghstack dependencies: #139295	2024-11-02 00:10:17 +00:00
PyTorch MergeBot	b617d4813c	Revert "fix dynamo tracking numpy 2 ops (#138686 )" This reverts commit `124eac255e`. Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))	2024-11-01 23:34:06 +00:00
eellison	2382b3b6d8	[Easy] Add joint graph passes, fallback_random to bisector (#139295 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295 Approved by: https://github.com/zou3519, https://github.com/exclamaforte	2024-11-01 23:21:53 +00:00
Gabriel Ferns	1e73842029	Refactor FxGraphDrawer to use HTML-like labels (#137726 ) Fixes https://github.com/pytorch/pytorch/issues/137499 Testing: Added a new unit test to make sure that the regression case succeeds. I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read? ![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932) Vs. ![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74) Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726 Approved by: https://github.com/eellison, https://github.com/malfet	2024-11-01 23:19:50 +00:00
David Berard	60542eeb33	[inductor] set sanitize_overflow=False for triton kernels (#139502 ) In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502 Approved by: https://github.com/bertmaher	2024-11-01 23:10:21 +00:00
Mikayla Gawarecki	a979318ef7	Add section to serialization note re weights_only (#139433 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433 Approved by: https://github.com/malfet ghstack dependencies: #138936, #139221	2024-11-01 21:51:50 +00:00
Edward Z. Yang	a6630bcf87	Profile guided optimization for automatic_dynamic (#139001 ) Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR. This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001 Approved by: https://github.com/oulgen	2024-11-01 21:43:25 +00:00
Xuan Zhang	9c2ffce71a	add condition for freeable input buffer (#139480 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480 Approved by: https://github.com/yf225 ghstack dependencies: #139396	2024-11-01 21:15:40 +00:00
Sam Larsen	c412a42ae2	[pt2 logging] move remote cache get/put logging up one level (#139423 ) Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see. Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x Differential Revision: D65290089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423 Approved by: https://github.com/oulgen	2024-11-01 21:06:59 +00:00
Animesh Jain	0e57f2b589	[invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326 Approved by: https://github.com/zou3519 ghstack dependencies: #139216, #139130	2024-11-01 21:02:32 +00:00
Animesh Jain	6a268c3fbb	[invoke_subgraph] Generate fake_inputs correctly (#139130 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130 Approved by: https://github.com/zou3519 ghstack dependencies: #139216	2024-11-01 21:02:32 +00:00
Animesh Jain	4c756cacfd	[invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216 Approved by: https://github.com/zou3519	2024-11-01 21:02:32 +00:00
Justin Chu	5d67efb809	[ONNX] New registration API (#135403 ) The ONNX custom ops registration API. ## Design 1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] \| Callable" parameter for specifying extra functions 2. Use a callable as the key to support all possible call_function targets in the fx graph 3. Allow a callable or a Sequence of callables as values. - When there is a single callable, it is the translation function for the op - When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes. - The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function. - Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions. ```py torch.onnx.export(dynamo=True, custom_translation_table = { torch.ops.aten.add: [overload1, overload2], torch.sym_not: sym_not_onnx, }) ``` Support for functions that handles complex inputs will be in separate PRs. fixes https://github.com/pytorch/pytorch/issues/138391 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403 Approved by: https://github.com/titaiwangms	2024-11-01 20:58:54 +00:00
Jason Ansel	73c0762a34	[inductor] Simplify remove_kernel_local_buffers (#139452 ) I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365, #139370	2024-11-01 20:36:39 +00:00
Yifu Wang	0dbc284a72	[SymmetricMemory] expose signal_pads as tensors in Python (#138754 ) ## This Stack This stack does the following things to support `xformers`-style, comm-aware Triton kernels: - Exposes `signal_pad`s as tensors in Python - Adds a binding for `cuMemsetAsync` These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns. ## This PR ```python # Obtain the signal pad of the specified peer rank as a tensor. # If both shape and dtype are unspecified, the returned tensor will be a # 1d uint32 tensor, which is most natural for signaling purposes. symm_mem.get_signal_pad(peer_rank) # If only shape is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape) symm_mem.get_signal_pad(peer_rank, shape) # If only dtype is specified, it is equivalent to: # symm_mem.get_signal_pad(peer_rank).view(dtype) symm_mem.get_signal_pad(peer_rank, dtype=dtype) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754 Approved by: https://github.com/weifengpy, https://github.com/lw	2024-11-01 20:17:15 +00:00
Haifeng Jin	124eac255e	fix dynamo tracking numpy 2 ops (#138686 ) Fixes #136559 As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking. This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged. Before this PR, the following tests failed: ``` PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors ``` With this PR, the supported/unsupported ops in NumPy 1 are not changed. For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list. I used the following scripts to check the differences before and after the change for both NumPy 1 & 2. The output is empty for NumPy 1 since there is no change. The output is a list of `numpy.random` that considered supported for NumPy 2. ```py from torch._dynamo import trace_rules import numpy as np def new_numpy_function_ids(): unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"} def is_supported(k, v, mod): if not callable(v): return False if not getattr(v, "__module__", None): return True if v.__module__ == mod.__name__: return True if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs: return True return False rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: for k, v in mod.__dict__.items(): if is_supported(k, v, mod): rv[id(v)] = f"{mod.__name__}.{k}" return rv def old_numpy_function_ids(): rv = {} for mod in trace_rules.NP_SUPPORTED_MODULES: rv.update( { id(v): f"{mod.__name__}.{k}" for k, v in mod.__dict__.items() if callable(v) and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__ } ) return rv rv1 = set(old_numpy_function_ids().values()) rv2 = set(new_numpy_function_ids().values()) for v in (rv1 - rv2): print(v) print("****") for v in (rv2 - rv1): print(v) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686 Approved by: https://github.com/lezcano, https://github.com/williamwen42	2024-11-01 19:51:40 +00:00
Mikayla Gawarecki	ea0e09b3f3	Add utility to get all unsafe globals in checkpoint (no pickletools dependency) (#139221 ) Fixes https://github.com/pytorch/pytorch/issues/129698 https://github.com/pytorch/pytorch/pull/139106 without pickletools Pull Request resolved: https://github.com/pytorch/pytorch/pull/139221 Approved by: https://github.com/malfet ghstack dependencies: #138936	2024-11-01 19:31:39 +00:00
rzou	f3b485eb2a	[reland] Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#137064 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. IF THIS IS BREAKING YOU, PLEASE REACH OUT, especially if it's been more than two weeks since this landed. You can flip the config locally as a workaround. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/137064 Approved by: https://github.com/albanD, https://github.com/eellison	2024-11-01 19:21:16 +00:00
Colin L. Rice	abc5d59dcb	config: create Config objects with JK support (#138766 ) This teaches install_config_module (and the underlying code) to understands Config objects. Additionally we've added a JK option to this which resolves the JK. This config gets stored within the _ConfigEntry class and is evaluated when __getattr__ is called. If justknobs is set, it'll call justknobs_check to see the result. Due to preceeding work, basically everything works correctly here and we had to update a couple of tests, and modify the getattr behaviour. Note that we are updating the justknob_check function to support a default option, to make default work. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138766 Approved by: https://github.com/ezyang	2024-11-01 19:20:37 +00:00
Sam Larsen	d8b606ecb5	[fx graph cache] Support freezing with FX graph caching (#136505 ) Summary: The main changes to support freezing are: 1) When pickling constant tensors as part of the cache key calculation: If freezing has not been applied, then keep the existing behavior (pickle the metadata and values). If freezing has been applied, then pickle the values if the constant will be inlined; otherwise, consider only the metadata. 2) If freezing has been applied, modify what we store in the cache: Instead of storing the constant attributes in the cache entry, store the _names_ of the constants, and then grab those constants from the GraphModule when we need attache the attributes to a newly-loaded Python module. Since the cache lookup path loads the Python module, this bullet means we need to thread through a GraphModule argument in several places. 3) Since this feature means that we may need to reload the same Python module path more than once (but attach different constant attributes), I changed PyCodeCache.load_by_key_path to not store an in-memory map of path to module (since there may be more than one). I don't _think_ this will have any affect on performance, however.. It's unclear why we were using an in-memory cache here anyway, since this function should only be called once for each module needed to be loaded. 4) Several tests were removing on-disk PyCodeCache artifacts by iterating over the modules. I made this more straightforward by implementing a cache_clear method that removes the on-disk artifacts. Arguably, this should have been the implementation all along. Differential Revision: [D63542170](https://our.internmc.facebook.com/intern/diff/D63542170) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136505 Approved by: https://github.com/eellison	2024-11-01 18:29:29 +00:00
vladimirrotariu	7d644f025f	make equation behind torch.isclose element-wise (#138459 ) The current formula behind torch.isclose, according to the docs, is ![imagen](https://github.com/user-attachments/assets/6b79f6d8-e675-4585-b26b-0c6933f7ecdd) However, torch.isclose acts element-wise, so this formula may be misleading at first, given that the docs said that `input` and `other` are the first, respectively second tensor to compare. I propose the following change, to stress the element-wise nature of the norms in the equation: ![imagen](https://github.com/user-attachments/assets/2926a1c6-c4fa-4c48-8874-106521d3f54c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138459 Approved by: https://github.com/soulitzer	2024-11-01 18:18:33 +00:00
sanchitintel	3cbf0c0bbf	[Inductor][CPP] Cache weight tiles in L1D for AMX int8 WoQ GEMM (#136688 ) # Summary The AMX ISA based GEMM micro-kernel template for int8 weight-only quantization (BF16 activation, int8 weights) should cache dequantized weights (int8 -> int32 -> fp32 -> bf16) so that they would not have to be dequantized again in subsequent calls to the _inner-kernel_ that uses the same weights. This change leverages the fact that even for BF16 x BF16 GEMM template, cache-blocking ensures that `Nr * Kc` weight elements are cached in L1D cache (more info [here](https://static.sched.com/hosted_files/pytorch2024/59/TorchInductor%20CPU%20Backend%20Advancements%20-%20New%20Features%20and%20Performance%20Improvements_20240915.pdf)). Here, `Nr` is the register blocking size for `N` dimension (at the granularity of the GEMM micro-kernel, it's currently also the cache blocking size for `N` dimension, although that may change in the future), and `Kc` is the cache blocking size for `K` dimension. The figure below is from the document linked above - <img width="476" alt="image" src="https://github.com/user-attachments/assets/e23e5476-d910-46d1-a9b3-cbf77de76d94"> ## Performance data Collected on 48 physical cores of one socket of Intel Xeon Platinum 8468H (Xeon SP 4th gen). Intel OpenMP & tcmalloc were preloaded. \|M \| N \| K \| Latency with ATen _weight_int8pack_mm \| Latency with codegened templated GEMM (current main branch) \| Latency with codegened templated GEMM (this PR) \| \|-----\|-----\|-----\|------\|----------\|----\| \|4096\|4096\|4096\| 45.844 ms \| 9.322 ms\| 5.2181 ms \| \|4096\|11008\|4096\| 127.618 ms \|24.6258 ms \| 13.6046 ms\| \|4096\|4096\|11008\| 121.953 ms \| 25.4692 ms \| 10.2669 ms \| \|4096\|32000\|4096\| 478.450 ms\| 75.3942 ms \| 48.21 ms \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/136688 Approved by: https://github.com/jgong5	2024-11-01 16:32:22 +00:00
Jason Ansel	b57b4b7f9b	[inductor] Move remove_kernel_local_buffers to Kernel (#139370 ) This method mutates the kernel, so it fits better in that class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139370 Approved by: https://github.com/shunting314 ghstack dependencies: #139364, #139365	2024-11-01 16:28:15 +00:00
Jason Ansel	1e934b473c	[inductor] Remove Node.last_usage mutation (#139365 ) I can't figure out why this is needed. Let's see if tests fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139365 Approved by: https://github.com/shunting314 ghstack dependencies: #139364	2024-11-01 16:28:15 +00:00
Jason Ansel	286d3ce266	[inductor] Remove SIMDKernel.last_usage (#139364 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139364 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 16:28:15 +00:00
Shuqiang Zhang	df0c1eceb9	[pgnccl][simple] clean up unused members of PGNCCL (#139436 ) Summary: Found those unused members when prototying something else. Better remove unused members Test Plan: CI Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/139436 Approved by: https://github.com/Skylion007	2024-11-01 16:25:04 +00:00
Bin Bao	33dce10ece	[AOTI][reland] Update zero size computation in clone_preserve_strides (#139458 ) Summary: Reland https://github.com/pytorch/pytorch/pull/139224. clone_preserve_strides implemented in _inductor/utils.py does not handle multi-dimensional 0-size tensor correctly. Differential Revision: [D65317451](https://our.internmc.facebook.com/intern/diff/D65317451) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139458 Approved by: https://github.com/hl475	2024-11-01 13:51:02 +00:00
Yifu Wang	e6e140c3d7	[Inductor] fix a compilation time regression caused by user-visible output handling (#139420 ) Some checks failed docker-builds / docker-build (pytorch-linux-focal-py3-clang10-onnx, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3-clang9-android-ndk-r21e, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3.11-clang10, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3.12-clang10, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-py3.9-clang10, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-rocm-n-1-py3, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-focal-rocm-n-py3, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-aarch64-py3.10-gcc11, linux.arm64.2xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks, linux.arm64.m7g.4xlarge, 600) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang12-executorch, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang15-asan, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3-clang18-asan, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.12-halide, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.9-gcc11, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks, linux.12xlarge) (push) Has been cancelled Details docker-builds / docker-build (pytorch-linux-jammy-xpu-2024.0-py3, linux.12xlarge) (push) Has been cancelled Details ossf-scorecard / Scorecards analysis (push) Has been cancelled Details Nightly Upload to rockset / upload-stats-to-rockset (push) Has been cancelled Details inductor-cu124-unittest / get-default-label-prefix (push) Has been cancelled Details inductor-cu124-unittest / cuda12.4-py3.12-gcc9-sm86 (push) Has been cancelled Details inductor-cu124-unittest / cuda12.4-py3.10-gcc9-sm86 (push) Has been cancelled Details inductor-rocm / get-label-type (push) Has been cancelled Details inductor-cu124 / inductor-unittest (push) Has been cancelled Details inductor-cu124 / get-default-label-prefix (push) Has been cancelled Details inductor-cu124 / get-a100-test-label-type (push) Has been cancelled Details inductor-rocm / rocm6.2-py3.10-inductor (push) Has been cancelled Details inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 (push) Has been cancelled Details inductor-cu124 / cuda12.4-py3.10-gcc9-sm80 (push) Has been cancelled Details This PR fixes a compilation time regression manifested in timm_models/hrnet_w18 caused by https://github.com/pytorch/pytorch/pull/136732. The regression is reproducible locally. The compilation time is a bit noisy, but it's still possible to tell the difference. ``` Before the offending PR compilation_latency mean=176.022 seconds compilation_latency mean=176.564 seconds On the offending PR compilation_latency mean=180.096 seconds compilation_latency mean=179.101 seconds On the fix compilation_latency mean=173.153 seconds compilation_latency mean=174.182 seconds ``` (I think the fix being faster than the baseline is due to noise) The cause of the regression is an inefficiency in `is_user_visible_output()`. Specifically, it used `output_node.args[0].index(node)` to obtain the output idx for each node (and we called this for each node twice). The offending PR had the assumption that `len(output_node.args[0])` is rather small. However, it has been proven false by the benchmark (it was 1900+ for timm_models/hrnet_w18). The fix is to precompute `user_visible_output_strides` once by iterating only over the nodes in `output_node.args[0]`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139420 Approved by: https://github.com/ezyang	2024-11-01 08:27:40 +00:00
Shunting Zhang	5e4c8b671c	[inductor] loaf-fix (#139376 ) Fix https://github.com/pytorch/pytorch/issues/128063 . Now for this snippet ``` def f(x): y = torch.sum(torch.sum(x, dim=-1)) z = x / 10.0 z_t = z.t().contiguous().t() return y, z, z_t ``` Inductor could generate a single kernel for the first reduction and the two ponitwise kernels (if loop-ordering after fusion is enabled). And the generated kernel read `x` only ONCE. (with no proper handling, the two pointwise's may each access x once even if they are fused). The PR needs fix 2 subtile bugs regarding LOAF . 1. when we reorder loops for a FusedSchedulerNode, we check if each sub-node's sizes matches. But some node has sizes in `list` type (if its loop is not reordered) while others have its sizes in `tuple` type (if its loop is reordered). I could change the upstream code to uniformly use either `list` or `tuple`. But without strong enforcement, future code could break this. So I just convert sizes to uniform type before comparison. 2. We have a cache for tiling decisions of a BaseSchedulerNode. If we reorder loops for the node, we should invalidate the cache. Otherwise, a stale tiling decision can result in (very) bad kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139376 Approved by: https://github.com/jansel, https://github.com/eellison	2024-11-01 07:54:32 +00:00
lingzhi98	39ec5a20ea	[Partitioner] Enumerate partitions by iterating partition ids (#136598 ) Currently, we get all partition id by iterating assignment whose size is same as the number of nodes in graph. But we can reach same results by iterating partitions_by_id whose size is much smaller than the nodes number. Assume the number of nodes is N, the number of partitions is P, the time complexity decrease from O(N * N) to O(N * P) after this patch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136598 Approved by: https://github.com/tarun292 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-01 07:42:36 +00:00
andras_matyassy	61df90e3f6	Add TORCHDYNAMO_EXTENDED_ADVICE (#137159 ) (#137196 ) Fixes #137159 Happy to contribute to this project for the first time. If I missed any contribution guidelines, please let me know, I'm happy to adjust. Pull Request resolved: https://github.com/pytorch/pytorch/pull/137196 Approved by: https://github.com/ezyang	2024-11-01 06:43:26 +00:00
angelayi	86db2cd194	[export] Initial draft export (#139383 ) Differential Revision: [D65288590](https://our.internmc.facebook.com/intern/diff/D65288590) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139383 Approved by: https://github.com/zou3519	2024-11-01 06:25:44 +00:00
FFFrog	300ca6368f	Remove depracated alias macro(2/3) (#137559 ) Detailed Descriptions: - Remove AT_ASSERTM Macro Pull Request resolved: https://github.com/pytorch/pytorch/pull/137559 Approved by: https://github.com/ezyang	2024-11-01 06:17:57 +00:00
William Wen	0c47657b05	[dynamo] ignore False/None callback in fail_on_recompile/force_backend stances (#139215 ) Fix https://github.com/pytorch/pytorch/issues/139202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139215 Approved by: https://github.com/jansel	2024-11-01 06:15:28 +00:00
cyy	4a2da52137	[1/N] Replace c10::sv with std::sv (#139453 ) Picks some safe replacements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139453 Approved by: https://github.com/Skylion007	2024-11-01 05:39:37 +00:00
Will Constable	84416618a6	[Pipelining] Update schedules to use I, B actions. (#138886 ) Also, update tests to use I (BACKWARD_INPUT) vs B (FULL_BACKWARD) consistently. Previously, schedules would issue a 'B' operation and leave it ambiguous whether that operation should be BACKWARD_INPUT or FULL_BACKWARD, depending on a separate flag (use_full_backward) passed to the schedule class, which would determine which behavior was taken at runtime. Now, use_full_backward is removed and the schedule class is required to produce unambiguous IR. The logic for 'use_full_backward' is removed from the runtime. _validate_pipeline_order is replaced with _simulate_comms_compute. Both offer similar functionality, to validate the corrrectness of a schedule IR. 'validate' operates on compute-only IR, while simulate operates on compute + comm IR. To convert from using validate to simulate, you have to first insert comm actions via '_add_send_recv'. 'simulate' was inefficiently written before this PR and needed to be optimized to run quickly for extra large schedules with >32 ranks and microbatches per rank used in some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138886 Approved by: https://github.com/H-Huang	2024-11-01 03:54:06 +00:00
Bob Ren	094d288f40	Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 ) As discussed w/ @ezyang offline, one way to de-risk the `specialize_float=False` rollout is to specialize all backed symfloats that we fail to tensorify away. This diff does a few things: 1) It fixes a bug where item_memo gets dropped (due to incorrect epoch invalidation) 2) It updates the tensorify pass to do the backup specialization This pass was originally part of the [PR](https://github.com/pytorch/pytorch/pull/137782) that flips `specialize_float=False` but we learned that the blast radius is simply too large. We've pivoted to a more milestone driven approach where we learn from the failures of the aforementioned PR and cherry pick fixes into main first. After this current PR lands our strategy is as follows: 1) Integrate turning off specialize float only in the automatic dynamic pass. 2) Put up a canary diff that only turns off specialize float in `backend=eager` mode to sniff out symfloat related bugs in dynamo due to code paths we previously never exercised. 3) Put up a canary diff that only turns off specialize float in `backend=aot_eager` mode to sniff out symfloat related bugs in aotautograd due to code paths we previously never exercised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138868 Approved by: https://github.com/ezyang	2024-11-01 03:18:02 +00:00
James Wu	c8a648d4df	Add option to dynamo_timed and chromium_event_logger for logging pt2 compile events (#139309 ) This diff considerably changes the column format of PT2 Compile Events: - Now, instead of logging one new column per every piece of metadata, we just log a single column, "metadata". This vastly decreases the number of columns we need to log, which should help with retention. - Now, we only log to scuba for a set of dynamo_timed() events that we actually care about aggregating. To do so, we add a boolean to dynamo_timed() that decides whether or not to log a pt2_compile_event. We'll always log a chromium_event for every dynamo_timed(), but only log a subset of those to scuba. Differential Revision: [D65225598](https://our.internmc.facebook.com/intern/diff/D65225598/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139309 Approved by: https://github.com/oulgen	2024-11-01 02:40:25 +00:00
Gabriel Ferns	030f70b40b	Allow inplacing buffer when other users are inconsequential (#138383 ) Summary: I think we can inplace a buffer if all of the users of said buffer are "inconsequential", defined as having been removed, being completed, or being part of the ancestors set. In particular, this allows LayerNorm to inplace its input buffer. Implements: https://github.com/pytorch/pytorch/issues/132826 Test Plan: New unit test of matmul followed by LayerNorm, make sure there's an inplaced buffer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138383 Approved by: https://github.com/eellison	2024-11-01 01:24:40 +00:00
cyyever	8ace3e8023	Add sv starts/ends_with (#139261 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139261 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-01 01:17:42 +00:00
Mikayla Gawarecki	2a309c0997	Fix weights_only for BUILD instructions for user allowlisted objects with __slots__ (#138936 ) Previously `BUILD` instruction missed handling for `__slots__`. This only applies for things allowlisted via `add_safe_globals`/`safe_globals` that use slots. ### Background When does pickle serialize a `BUILD` instruction? When `state` is not `None` and `state_setter` is `None` [[link](`c5b99f5c2c/Lib/pickle.py (L765)`)]. In this case, the docs tell us that either `__setstate__` or a `__dict__` update will be performed [[link](https://github.com/python/cpython/blob/3.13/Lib/pickletools.py#L1984)] `__reduce__`/`__reduce_ex__` are expected to return tuples of length 2 to 6 where `state` is the 3rd argument. When user doesn't patch `__reduce__` but patches `__setstate__`/`__getstate__`, state will be what is yielded by `__getstate__` Note the return type for [`__getstate__` ](https://docs.python.org/3/library/pickle.html#object.__getstate__) - For a class that has no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is None. - For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and no [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is `self.__dict__`. - For a class that has an instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__) and [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__), the default state is a tuple consisting of two dictionaries: `self.__dict__`, and a dictionary mapping slot names to slot values. Only slots that have a value are included in the latter. - For a class that has [`__slots__`](https://docs.python.org/3/reference/datamodel.html#object.__slots__) and no instance [`__dict__`](https://docs.python.org/3/reference/datamodel.html#object.__dict__), the default state is a tuple whose first item is None and whose second item is a dictionary mapping slot names to slot values described in the previous bullet. see handling in pickle code `c5b99f5c2c/Lib/pickle.py (L1846-L1867)` Before this PR, we didn't account for the fact that when `__setstate__` is not defined, `state` might be a tuple so this would fail ```python from dataclasses import dataclass # Define the dataclass @dataclass class MyDataClass: __slots__ = ["x", "y"] x: int y: str # Create an instance of the dataclass my_data = MyDataClass(x=2, y=3) # Save the dataclass to a file torch.save(my_data, "my_data.pt") with torch.serialization.safe_globals([MyDataClass]): loaded_my_data = torch.load("my_data.pt", weights_only=True) # AttributeError: 'MyDataClass' object has no attribute '__dict__' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138936 Approved by: https://github.com/malfet	2024-11-01 00:59:29 +00:00
Jason Ansel	f9ef880c0b	[inductor] Refactor kernel args into SIMDKernelFeatures (#139327 ) This is a refactor PR to move stuff around. I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327 Approved by: https://github.com/eellison, https://github.com/shunting314	2024-11-01 00:30:14 +00:00
PyTorch MergeBot	b6b9596607	Revert "[dynamo] Fix constant propagation in builtins and UserClasses (#131354 )" This reverts commit `44257c063e`. Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it seems to break some internal tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2451050605))	2024-11-01 00:13:20 +00:00
IvanKobzarev	d33849908d	[aotd] Fuse tangents subclasses runtime traversals (#139068 ) Reason: Currently we have multiple traversals for tangents in runtime: - To check that types and structure are identical to what we guessed during tracing time - Coerce metadata - Coerce memory_format - Unwrap_tensor_subclass All of them are traversing tangents via __tensor_flatten__ calls the tree of Subclasses. Change: To do everything in one traversal at runtime (including flattening) Implementation details: Add memory_format information inside SubclassCreationMeta, for PlainTensors keep not only (int) of unwrapped_index, but memory_format too. Preparing memory_format is optional (controlled by with_memory_format=True). 2. Removing unused subclass_utils.create_metadata_for_subclass which does not have any usages inside torch and would require update of the logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139068 Approved by: https://github.com/bdhirsh	2024-11-01 00:03:02 +00:00
Xuan Zhang	86602a66d7	[orm] fix live_memory computation in lpmf algorithm (#139396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139396 Approved by: https://github.com/yf225	2024-10-31 23:45:30 +00:00
PyTorch MergeBot	3d3551506d	Revert "[dynamo, 3.13] fix bytecode nop tests (#139323 )" This reverts commit `c2d754441f`. Reverted https://github.com/pytorch/pytorch/pull/139323 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a regression in instruction count metric ([comment](https://github.com/pytorch/pytorch/pull/139323#issuecomment-2451017609))	2024-10-31 23:34:00 +00:00
Will Constable	8e8040a5c2	[Pipelining] Optimize ready_to_schedule logic (#138924 ) Used in both simulator and add_send_recv pass, the ready_to_schedule logic works by looking at all the previously scheduled ops on a rank to see if any of them 'unblocks' the current op to be scheduled. For example, to schedule a FORWARD op, a previous RECV_F op is needed, unless this is stage 0 or there is a previous stage on the same rank that ran FORWARD already. The old implementation iteratively compared the candidate op to the previous ops. The new implementation uses set lookups to reduce complexity. It also maintains the set of previous ops as ops are scheduled rather than constructing a set on demand. I did not save benchmark results, but this results in a 10-100x speedup which is most noticeable for unit tests with artificially huge schedule IR, the largest of which took longer than 20m before (I never let it finish) but now takes less than 14s. Most schedules take less than 10ms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/138924 Approved by: https://github.com/H-Huang ghstack dependencies: #138928, #131762	2024-10-31 22:49:45 +00:00
Will Constable	c82e0d117a	[Pipelining] Support separate dI / dW and V-schedules (#131762 ) ### Separate dI / dW: PipelineScheduleRuntime now supports execution of merged FULL_BACKWARD or separate dI / dW operations. Separating the B and W may add execution overhead or may be suboptimal in cases where BW are 'fused', but it is worthwhile when separating B, W lets the schedule be more efficient by filling in bubbles. In some cases, the schedule will still issue B followed by W at certain points, so in these cases just merge them back into BW ops and execute them as full backwards rather than executing a B followed by a W. ### V-schedules: V-schedules have a special case where the last rank has 2 adjacent stages. E.g. if rank3 had stage 3 and stage 4, then we should implement direct transfer of stage3 outputs to stage4 inputs without a send/recv. In the schedling logic, we also must allow scheduling the stage 4 forward after running stage 3 forward, without expecting a stage 4 RECV_F In the runtime, we pass activations between adjacent stages without using SEND/RECV ops since the stages are on the same rank/process. We add new APIs to PipelineStage abstraction for passing the activations both during forward and backward. Currently the implementation directly modifies the 'recv buffers' the stage is managing, so the forward/backwrad execution logic does not need to know the difference. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131762 Approved by: https://github.com/H-Huang ghstack dependencies: #138928	2024-10-31 22:49:45 +00:00
Zhengxu Chen	45da80b970	reland D65167805 "[export] Update min_val and max_val to Optional[int] in serialization." (#139394 ) Summary: had a land racing with another diff D65166035 to fix the schema. According to export team's discussion, we are upgrading min_val and max_val to optional fields which shouldn't break BC and allows the schema to express infinity. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/rec/ir/tests:ir_export_deserialize_test Differential Revision: D65273170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/139394 Approved by: https://github.com/yiming0416	2024-10-31 22:28:32 +00:00
Donald Tolley	c1e7d85ce6	Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss (#132049 ) #### Summary This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others. #### Changes - `weighted_huber_loss`: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter. - `wmse_loss` (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks. - `wmae_loss` (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers. #### Code Details - Input Validation: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors. - Reduction Options: Supports `none`, `mean`, and `sum` reductions to suit various computational needs. - Backward Compatibility: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument. #### Usage Example ```python import torch input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32) target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32) weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32) loss = weighted_huber_loss(input, target, weights, delta=1.0) print(loss) ``` --- Feedback on these implementations is welcome; please let me know if further modifications are required. Resolves #132465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2024-10-31 21:59:43 +00:00
Simon Fan	82e74ad40e	[aot autograd] refactor CompiledFunction.backward: control flow (3/N) (#139347 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139347 Approved by: https://github.com/zou3519 ghstack dependencies: #139331, #139343	2024-10-31 21:53:03 +00:00
Simon Fan	8134456a27	[aot autograd] refactor CompiledFunction.backward: epilogue (2/N) (#139343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139343 Approved by: https://github.com/zou3519 ghstack dependencies: #139331	2024-10-31 21:53:03 +00:00
Simon Fan	04ce9ec087	[aot autograd] refactor CompiledFunction.backward: prologue (1/N) (#139331 ) So for functional autograd + CA, most nodes are inlined in aot autograd. But user-defined callables aren't safe to make_fx unless dynamo traces through them. The AOT backward must be inlined by dynamo time. We plan to directly insert calls to the backward in the graph: - call prologue - call bwd graph - call epilogue Restructuring our AOT bwd implementation will make this implementation easier. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139331 Approved by: https://github.com/zou3519	2024-10-31 21:53:03 +00:00
angelayi	8c22e09e39	[aoti] Add masked_select to cshim (#139071 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/139071 Approved by: https://github.com/desertfire	2024-10-31 21:52:53 +00:00
PyTorch MergeBot	b9acbde4fd	Revert "Update tensorify pass to specialize symfloats we didn't tensorify away (#138868 )" This reverts commit `a494572799`. Reverted https://github.com/pytorch/pytorch/pull/138868 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the new tests are failing on fbcode ([comment](https://github.com/pytorch/pytorch/pull/138868#issuecomment-2450863895))	2024-10-31 21:46:06 +00:00
Laith Sakka	6a1c451479	Don't uselessly recompute axiom dict every static eval call (#138967 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/138967 Approved by: https://github.com/ezyang	2024-10-31 21:16:55 +00:00
PyTorch MergeBot	c4d9428b17	Revert "[AOTI] Update zero size computation in clone_preserve_strides (#139224 )" This reverts commit `206a8dde68`. Reverted https://github.com/pytorch/pytorch/pull/139224 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/139224#issuecomment-2450811914))	2024-10-31 21:05:07 +00:00
Joel Schlosser	ddb291a881	Fix and test several NJT reductions (#139317 ) I'm sick of reductions not working properly - spotty dim coverage, missing backwards, etc. This PR fixes quite a bit. It applies to the following ops: * `sum` / `mean` / `prod` * `all` / `any` * `amin` / `amax` * `min` / `max` * `argmin` / `argmax` The general reduction logic has been factored out into a helper `_apply_reduction(func, func_name, identity_element, args, kwargs)`. The idea is that by providing a valid identity element, we can utilize conversions to padded dense when needed for reducing over the ragged dim. Extensive test coverage includes: reductions across ragged dim * reductions across non-batch, non-ragged dims * reductions across both batch and ragged dims * multiple dim reductions (for ops that support this) * full reduction -> scalar Bonus: the PR includes backwards fixes for `sum` and `mean`, which have never worked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139317 Approved by: https://github.com/cpuhrsch	2024-10-31 20:55:38 +00:00

... 2 3 4 5 6 ...

43731 Commits