pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Zihua Wu	d62bdb087d	[Profiler] add missing field device_resource_id (#121480 ) Fixes #121479 Co-authored-by: Aaron Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480 Approved by: https://github.com/aaronenyeshi	2024-03-12 21:42:53 +00:00
PyTorch MergeBot	5b506c8bce	Revert "[dynamo][guards] Use lazy variable tracker for func defaults (#121388 )" This reverts commit `04a5d6e8d3`. Reverted https://github.com/pytorch/pytorch/pull/121388 on behalf of https://github.com/osalpekar due to causing executorch model-test failures internally. See [D54707529](https://www.internalfb.com/diff/D54707529) ([comment](https://github.com/pytorch/pytorch/pull/121388#issuecomment-1992619251))	2024-03-12 21:31:18 +00:00
Shunting Zhang	522d972924	[eazy] add more log when accuracy check fail (#121656 ) Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training. With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656 Approved by: https://github.com/jansel, https://github.com/Skylion007	2024-03-12 20:58:20 +00:00
Manuel Candales	6d8a7d6e58	[pytorch] optional zero points on dequantize per channel (#121724 ) Summary: X-link: https://github.com/pytorch/executorch/pull/2364 bypass-github-export-checks Test Plan: sandcastle Reviewed By: mikekgfb Differential Revision: D54709217 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121724 Approved by: https://github.com/mikekgfb	2024-03-12 19:54:11 +00:00
Colin Peppler	a6149eba12	[easy] Refactor MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662 ) Summary: # Why? Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here. Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well. Test Plan: ci Differential Revision: D54764829 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662 Approved by: https://github.com/aakhundov	2024-03-12 19:27:56 +00:00
Tugsbayasgalan Manlaibaatar	90e886aa6c	Sanity check for non-strict (#121687 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #121652, #121678	2024-03-12 18:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	443e241cc5	Don't cache predispatch kernels (#121712 ) Summary: Title Test Plan: CI Differential Revision: D54791087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712 Approved by: https://github.com/ydwu4	2024-03-12 18:05:59 +00:00
Wanchao Liang	a26480a4d1	[dtensor] move early return check into redistribute autograd function (#121653 ) This PR fixed the bug of redistribute to move early return check into the redistribute autograd function, so that even though we redistribute the same placement, the grad_placements from the `to_local` call might be different, the redistribute backward still need to happen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653 Approved by: https://github.com/awgu	2024-03-12 17:37:30 +00:00
Animesh Jain	22489bfe70	[dynamo][guards-cpp-refactor] Directly call root guard manager in eval_frame (#121622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121622 Approved by: https://github.com/jansel ghstack dependencies: #121614	2024-03-12 17:09:11 +00:00
Animesh Jain	2348e8e4e7	[dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614 ) Use NO_HASATTR guard for the common part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614 Approved by: https://github.com/jansel	2024-03-12 17:08:56 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit `d482614fec`. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
angelayi	d1715c3adb	[export] Update error message for set_grad (#121666 ) Context: https://fb.workplace.com/groups/222849770514616/posts/381979051268353/?comment_id=383334957799429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121666 Approved by: https://github.com/ydwu4	2024-03-12 16:41:45 +00:00
Jason Ansel	3c8c7e2a46	[dynamo] Tweak naming for module hook bw_state (#121609 ) Some minor changes not related to the other PRs in the stack Pull Request resolved: https://github.com/pytorch/pytorch/pull/121609 Approved by: https://github.com/yanboliang	2024-03-12 16:27:56 +00:00
Chien-Chin Huang	7a68e0a3e8	[DCP][state_dict] Remove the check of FSDP has root (#121544 ) Root may not exist due to FSDP lazy initialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121544 Approved by: https://github.com/Skylion007 ghstack dependencies: #121273, #121276, #121290	2024-03-12 15:43:19 +00:00
Andrew Gu	85dc254364	[DTensor] Moved `Transformer` sharding to staticmethod (#121660 ) To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests. Test Plan: ``` pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660 Approved by: https://github.com/wanchaol, https://github.com/yifuwang ghstack dependencies: #121360, #121357	2024-03-12 15:08:57 +00:00
Howard Huang	2a99e6f299	Update error message (#121644 ) Summary: We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead. Update the error message to explicitly say that sparse_allreduce is not supported. Test Plan: sandcastle Differential Revision: D54759307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644 Approved by: https://github.com/awgu	2024-03-12 13:04:21 +00:00
kausik	edf22f3a48	Modify signature of dequantize ops for decomposed quantized Tensor (#119173 ) (#121450 ) Summary: X-link: https://github.com/pytorch/executorch/pull/2308 Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any. At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization. This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead. cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel Reviewed By: digantdesai Differential Revision: D53590486 Pulled By: manuelcandales Co-authored-by: kausik <kmaiti@habana.ai> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450 Approved by: https://github.com/jerryzh168	2024-03-12 12:36:31 +00:00
Adnan Akhundov	06d2392003	Support tt.reduce in Triton kernel analysis pass (#121706 ) Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore. Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706 Approved by: https://github.com/jansel	2024-03-12 11:38:28 +00:00
Animesh Jain	78b4793c96	[dynamo][compile-time] Caching VTs to reduce compile-time (#121031 ) Reduces the `torch.compile(backend="eager")` for this code ~~~ def fn(x): for _ in range(10000): # x = torch.sin(x) x = torch.ops.aten.sin(x) # x = sin(x) return x ~~~ From 18 seconds to 12 seconds. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031 Approved by: https://github.com/jansel	2024-03-12 09:19:50 +00:00
lezcano	86a2d67bb9	Simplify guards using info from previous guards (#121463 ) Let me see what CI thinks about this one. Will add tests tomorrow. Fixes https://github.com/pytorch/pytorch/issues/119917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463 Approved by: https://github.com/ezyang	2024-03-12 04:22:20 +00:00
Shen Xu	159f30331f	[quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548 ) Test Plan: ``` buck run caffe2/test:quantization_pt2e ``` Differential Revision: D54454707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548 Approved by: https://github.com/jerryzh168	2024-03-12 02:59:12 +00:00
eellison	6ca9ae4f86	Express y grid > 2^16 in terms of z grid (#121554 ) CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554 Approved by: https://github.com/aakhundov	2024-03-12 02:36:19 +00:00
Jane Xu	fb1d7935bb	[optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618 Approved by: https://github.com/albanD	2024-03-12 02:33:21 +00:00
Xinya Zhang	a37e22de70	Add Flash Attention support on ROCM (#121561 ) This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton) - [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`). * MI300X is supported. More architectures will be added once Triton support them. - [x] Only supports power of two sequence lengths. * Now it support arbitrary sequence length - [ ] No support for varlen APIs. * varlen API will be supported in the next release of AOTriton - [x] Only support head dimension 16,32,64,128. * Now it support arbitrary head dimension <= 256 - [x] Performance is still being optimized. * Kernel is selected according to autotune information from Triton. Other improvements from AOTriton include * Allow more flexible Tensor storage layout * More flexible API This is a more extensive fix to #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561 Approved by: https://github.com/malfet, https://github.com/atalman	2024-03-12 01:16:53 +00:00
Kefei Lu	3a5f48d55f	Port remove_split_ops to PT2 pre-grad passes (#121674 ) Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain. Test Plan: Run test cases Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34. Reviewed By: frank-wei Differential Revision: D54711064 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674 Approved by: https://github.com/frank-wei	2024-03-12 01:15:19 +00:00
Elias Ellison	5b5d423c2e	Benchmark templates (#118880 ) Adding support for benchmarking templates in `benchmark_fusion` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880 Approved by: https://github.com/shunting314	2024-03-11 23:55:13 +00:00
Mu-Chu Lee	7676433012	[AOTInductor] Reuse generated kernels between constant graph and main graph (#121564 ) Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated. Test Plan: Included in commit Differential Revision: D54706767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564 Approved by: https://github.com/desertfire, https://github.com/chenyang78	2024-03-11 22:44:38 +00:00
Andrew Gu	272cf29e4d	[FSDP2][BE] Refactored `check_1d_sharded_parity` to use mesh (#121357 ) Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357 Approved by: https://github.com/weifengpy ghstack dependencies: #121360	2024-03-11 22:34:42 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
PyTorch MergeBot	b2f09c1859	Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 )" This reverts commit `d27509c384`. Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))	2024-03-11 22:18:36 +00:00
Alexander Grund	d1f45a93af	Check for releasing GIL at compiletime (#116695 ) Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster. @albanD This is the GIL change extracted from #112607 as discussed. Also fixes a potential use of a moved-from object introduced in #116560: - `f` is captured by value in a lambda that may be used called times - After `std::move(f)` the lambda is not safe to call anymore CC @cyyever for that change Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-03-11 22:04:56 +00:00
Sam Larsen	fd13a56f61	Refactor some testing helpers for FX graph cache testing (#121520 ) Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520 Approved by: https://github.com/eellison	2024-03-11 21:46:27 +00:00
Kefei Lu	fc712311ce	port fuse_parallel_linear (without changing weights) to PT2 pre-grad (#121617 ) Summary: Does not change weights structure so compatible with const folding and realtime weights update Test Plan: run added test cases Reviewed By: frank-wei Differential Revision: D53843428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617 Approved by: https://github.com/frank-wei	2024-03-11 20:51:11 +00:00
Zhenghao Zhao	3461404869	[pt2 export]fix name collision on constant name (#121145 ) Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args. Test Plan: added test case Differential Revision: D54435230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145 Approved by: https://github.com/zhxchen17	2024-03-11 20:40:59 +00:00
Jason Ansel	9aa3fedb75	Slightly faster FX graph iterator (#121611 ) Before: ``` iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s) ``` After: ``` iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611 Approved by: https://github.com/oulgen	2024-03-11 20:00:19 +00:00
Daniel Herrera	dccc1ca839	[torch] Use __prepare_scriptable__ for closures (#121553 ) Summary: This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229 The object is using __prepare_scriptable__ correctly inside of torch.jit.script() but the clousre that is obtained below is using the non-prepared version. This causes issues when the prepared and non-prepared versions are in different python modules. Test Plan: ``` buck2 run mode/opt caffe2/test:jit -- -r test_decorator ``` Differential Revision: D54308741 Re-exporting, as #120806 #121307 were not properly merged. Co-authored-by: Daniel Herrera <dherrera@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553 Approved by: https://github.com/huydhn, https://github.com/seemethere	2024-03-11 19:14:19 +00:00
Nikita Shulga	e29004615f	Add NEON accelerated torch.mv kernel (#119992 ) This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table \| op \| original \| F32+NEON \| F16+NEON\| \| ---\| -------- \| ---------- \| ----- \| \| torch.mv(m, v) \| 209.53 usec \| 16.25 usec \| 14.68 usec \| \| torch.mv(m.t(), v) \| 104.80 usec \| 28.68 usec \| 24.82 usec \| Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used) To investigate: - why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992 Approved by: https://github.com/mikekgfb	2024-03-11 16:00:01 +00:00
Thiago Crepaldi	6c11d3ce0c	Add support to save safetensors checkpoint directly into onnx (#121001 ) Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for the newly exported ONNX model. This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished. Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001 Approved by: https://github.com/BowenBao, https://github.com/malfet	2024-03-11 15:21:59 +00:00
FFFrog	485f8ebc07	add __repr__ function to FunctionSchema for Python (#121484 ) Fixes #118566 Unlike OpOverload or OpOverloadPacket, there is a lot of complex information in the schema, so for me keeping it as is is probably a good choice, but in theory the \_\_repr__ function should show the class name as well as some other key information. If you have any choices, please show me, thank you. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121484 Approved by: https://github.com/Skylion007	2024-03-11 15:16:50 +00:00
Xilun Wu	605c0a28aa	[dtensor][debug] force visualize_sharding not to print for empty tensors (#121217 ) Summary Current `visualize_sharding` code cannot print for empty DTensor objects which leads to an exception. This PR skips the print logic if the DTensor passed in has 0 element. <img width="2165" alt="Pasted Graphic" src="https://github.com/pytorch/pytorch/assets/12968408/fa40b5e7-dad7-4d3a-a485-6a18067320ff"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121217 Approved by: https://github.com/wanchaol ghstack dependencies: #121385, #121382	2024-03-11 09:22:49 +00:00
Xilun Wu	3a5ab17bdc	[dtensor][debug] visualize_sharding skip if the current rank is not in mesh (#121382 ) Summary We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382 Approved by: https://github.com/wanchaol ghstack dependencies: #121385	2024-03-11 09:22:49 +00:00
Xilun Wu	b383123e37	[dtensor][debug] visualize_sharding only compute offset on the first rank in mesh (#121385 ) Summary avoid computing on ranks where we do not plan to visualize the DTensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121385 Approved by: https://github.com/wanchaol	2024-03-11 09:22:31 +00:00
kungyork	9c50ecc84b	Fix `get_rank` under a non-default group. (#120481 ) Fixes #120213 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120481 Approved by: https://github.com/yifuwang	2024-03-11 05:40:54 +00:00
Jason Ansel	7cc476ea16	[dynamo] Fix support for nn.Parameter constructor (part 1) (#120163 ) This captures calls to `torch.nn.Parameter` by lifting them to graph inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163 Approved by: https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #121086	2024-03-11 05:14:42 +00:00
Jason Ansel	32488b0664	[dynamo] Support _unsafe_set_version_counter (#121086 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086 Approved by: https://github.com/yanboliang	2024-03-11 05:14:42 +00:00
Ze Sheng	7a4e451184	[Dynamo] Fix function overrides (#120885 ) To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case Fixes #120653 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885 Approved by: https://github.com/yanboliang	2024-03-11 02:18:43 +00:00
Kefei Lu	f11f2b0d55	split predispatch pass into multiple passes (#121592 ) Summary: It's very difficult to debug the passes ineffectiveness, with them mingled in one single pass container. Here we extract them into seperate passes with diagnostics info. This is also required for a later change, where we must run shape prop on each of these passes, in order for the subsequent passes to have the correct shape information. Reviewed By: frank-wei Differential Revision: D53579545 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121592 Approved by: https://github.com/frank-wei	2024-03-11 00:30:55 +00:00
Avik Chaudhuri	13e8181b7b	relax assertion on fake shape (#121599 ) Summary: Seems like if you use `capture_pre_autograd_graph` fake tensor shapes can be ints instead of symints. Test Plan: fixes the AssertionError in N5057219 Differential Revision: D54729142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121599 Approved by: https://github.com/angelayi, https://github.com/BoyuanFeng	2024-03-10 22:51:10 +00:00
Oguz Ulgen	660ec3d38d	[Export] Fix bug removing node from wrong graph (#121574 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121574 Approved by: https://github.com/ydwu4	2024-03-10 04:46:11 +00:00
Yifu Wang	41286f1505	[IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575 ) `hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575 Approved by: https://github.com/Chillee	2024-03-10 00:55:25 +00:00

1 2 3 4 5 ...

36373 Commits