Commit Graph

36373 Commits

Author SHA1 Message Date
Zihua Wu
d62bdb087d [Profiler] add missing field device_resource_id (#121480)
Fixes #121479

Co-authored-by: Aaron Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480
Approved by: https://github.com/aaronenyeshi
2024-03-12 21:42:53 +00:00
PyTorch MergeBot
5b506c8bce Revert "[dynamo][guards] Use lazy variable tracker for func defaults (#121388)"
This reverts commit 04a5d6e8d3.

Reverted https://github.com/pytorch/pytorch/pull/121388 on behalf of https://github.com/osalpekar due to causing executorch model-test failures internally. See [D54707529](https://www.internalfb.com/diff/D54707529) ([comment](https://github.com/pytorch/pytorch/pull/121388#issuecomment-1992619251))
2024-03-12 21:31:18 +00:00
Shunting Zhang
522d972924 [eazy] add more log when accuracy check fail (#121656)
Add these log to debug the regress of accuracy test for dm_nfnet_f0 model for training.

With these extra log when the accuracy check fail, we can verify if it's close to succeed or not. If yes that indicates there is no real issue but just flaky and we probably can tune the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121656
Approved by: https://github.com/jansel, https://github.com/Skylion007
2024-03-12 20:58:20 +00:00
Manuel Candales
6d8a7d6e58 [pytorch] optional zero points on dequantize per channel (#121724)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2364

bypass-github-export-checks

Test Plan: sandcastle

Reviewed By: mikekgfb

Differential Revision: D54709217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121724
Approved by: https://github.com/mikekgfb
2024-03-12 19:54:11 +00:00
Colin Peppler
a6149eba12 [easy] Refactor MultiOutput. codegen_list_tuple_access to use subclass type checks (#121662)
Summary:
# Why?

Right now I'm running into a case where `itype` is `torch.fx.immutable_collections.immutable_list` which is a subclass of `list`. However, currently we're checking the concrete types (i.e. `list`) and `immutable_list` isn't explictly supported here.

Thus, we use a runtime check that looks at the subclass so we can support subclasses -- such as immutable_list -- as well.

Test Plan: ci

Differential Revision: D54764829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121662
Approved by: https://github.com/aakhundov
2024-03-12 19:27:56 +00:00
Tugsbayasgalan Manlaibaatar
90e886aa6c Sanity check for non-strict (#121687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121687
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #121652, #121678
2024-03-12 18:21:32 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
443e241cc5 Don't cache predispatch kernels (#121712)
Summary: Title

Test Plan: CI

Differential Revision: D54791087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121712
Approved by: https://github.com/ydwu4
2024-03-12 18:05:59 +00:00
Wanchao Liang
a26480a4d1 [dtensor] move early return check into redistribute autograd function (#121653)
This PR fixed the bug of redistribute to move early return check into the
redistribute autograd function, so that even though we redistribute the
same placement, the grad_placements from the `to_local` call might be
different, the redistribute backward still need to happen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121653
Approved by: https://github.com/awgu
2024-03-12 17:37:30 +00:00
Animesh Jain
22489bfe70 [dynamo][guards-cpp-refactor] Directly call root guard manager in eval_frame (#121622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121622
Approved by: https://github.com/jansel
ghstack dependencies: #121614
2024-03-12 17:09:11 +00:00
Animesh Jain
2348e8e4e7 [dynamo][guards-cpp-refactor] Simplify DYNAMIC_INDICES guard (#121614)
Use NO_HASATTR guard for the common part.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121614
Approved by: https://github.com/jansel
2024-03-12 17:08:56 +00:00
PyTorch MergeBot
0398dc9e8e Revert "[DCP] Makes fsspec public (#121508)"
This reverts commit d482614fec.

Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))
2024-03-12 17:02:43 +00:00
angelayi
d1715c3adb [export] Update error message for set_grad (#121666)
Context: https://fb.workplace.com/groups/222849770514616/posts/381979051268353/?comment_id=383334957799429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121666
Approved by: https://github.com/ydwu4
2024-03-12 16:41:45 +00:00
Jason Ansel
3c8c7e2a46 [dynamo] Tweak naming for module hook bw_state (#121609)
Some minor changes not related to the other PRs in the stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121609
Approved by: https://github.com/yanboliang
2024-03-12 16:27:56 +00:00
Chien-Chin Huang
7a68e0a3e8 [DCP][state_dict] Remove the check of FSDP has root (#121544)
Root may not exist due to FSDP lazy initialization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121544
Approved by: https://github.com/Skylion007
ghstack dependencies: #121273, #121276, #121290
2024-03-12 15:43:19 +00:00
Andrew Gu
85dc254364 [DTensor] Moved Transformer sharding to staticmethod (#121660)
To support FSDP + TP/SP unit tests, let us factor out the canonical TP/SP sharding of `Transformer` to a staticmethod that can be called by other unit tests.

Test Plan:
```
pytest test/distributed/tensor/parallel/test_tp_examples.py -k test_transformer_training
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121660
Approved by: https://github.com/wanchaol, https://github.com/yifuwang
ghstack dependencies: #121360, #121357
2024-03-12 15:08:57 +00:00
Howard Huang
2a99e6f299 Update error message (#121644)
Summary:
We don't want people to move to NCCL exp without explicit opt in. It seems that sparse allreduce was accidentally called and people were confused whether they should use NCCL exp instead.

Update the error message to explicitly say that sparse_allreduce is not supported.

Test Plan: sandcastle

Differential Revision: D54759307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121644
Approved by: https://github.com/awgu
2024-03-12 13:04:21 +00:00
kausik
edf22f3a48 Modify signature of dequantize ops for decomposed quantized Tensor (#119173) (#121450)
Summary:
X-link: https://github.com/pytorch/executorch/pull/2308

Note: The initial purpose of this PR is to draw suggestion and feedback regarding better alternative, if any.

At present, dequantize op for decomposed quantized Tensor representation e.g. dequantize_per_tensor() assumes the output dtype as torch.float and hence, it does not have the output dtype in its operator argument list. However, this op signature becomes unusable when the assumption breaks. Because, in case the output dtype is different from torch.float, there is no way to specify the same during dequantization.

This change is aimed at generalizing the signature of dequantize op like dequantize_per_tensor() for wider use-cases where the output dtype can be different from torch.float and needs to passed during dequantization. The proposal is to use an additional argument named 'output_dtype' to solve the problem. However, we would also like to have suggestion and feedback regarding any better alternative that can be used instead.

cc jerryzh168 jianyuh raghuramank100 jamesr66a vkuzo jgong5 Xia-Weiwen leslie-fang-intel

Reviewed By: digantdesai

Differential Revision: D53590486

Pulled By: manuelcandales

Co-authored-by: kausik <kmaiti@habana.ai>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121450
Approved by: https://github.com/jerryzh168
2024-03-12 12:36:31 +00:00
Adnan Akhundov
06d2392003 Support tt.reduce in Triton kernel analysis pass (#121706)
Summary: Previously, we bailed out of the Triton kernel analysis pass when seeing a `tt.reduce` op. In this PR, we support the op and don't bail out anymore.

Test Plan: This is a bit tricky, as the extension is added to the MLIR walk-based analysis code path which is active only on when the MLIR bindings added in https://github.com/openai/triton/pull/3191 are available. So for now I've run the `test_argmax` and `test_reduce_sum` manually with a newer Triton version than the current pin. When pin updates, we'll make those tests official (left a TODO comment).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121706
Approved by: https://github.com/jansel
2024-03-12 11:38:28 +00:00
Animesh Jain
78b4793c96 [dynamo][compile-time] Caching VTs to reduce compile-time (#121031)
Reduces the `torch.compile(backend="eager")` for this code

~~~
def fn(x):
    for _ in range(10000):
        # x = torch.sin(x)
        x = torch.ops.aten.sin(x)
        # x = sin(x)

    return x
~~~

From 18 seconds to 12 seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121031
Approved by: https://github.com/jansel
2024-03-12 09:19:50 +00:00
lezcano
86a2d67bb9 Simplify guards using info from previous guards (#121463)
Let me see what CI thinks about this one. Will add tests tomorrow.

Fixes https://github.com/pytorch/pytorch/issues/119917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121463
Approved by: https://github.com/ezyang
2024-03-12 04:22:20 +00:00
Shen Xu
159f30331f [quant][pt2e] Call sub-quantizers' transform_for_annotation in ComposableQuantizer (#121548)
Test Plan:
```
buck run caffe2/test:quantization_pt2e
```

Differential Revision: D54454707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121548
Approved by: https://github.com/jerryzh168
2024-03-12 02:59:12 +00:00
eellison
6ca9ae4f86 Express y grid > 2^16 in terms of z grid (#121554)
CUDA has a max y_grid of 65535. If we're computing larger than that we can compose it in terms of z grid, which is currently unused in inductor codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121554
Approved by: https://github.com/aakhundov
2024-03-12 02:36:19 +00:00
Jane Xu
fb1d7935bb [optim][BE] move complex_2d (last of complex tests) to OptimInfo (#120618)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120618
Approved by: https://github.com/albanD
2024-03-12 02:33:21 +00:00
Xinya Zhang
a37e22de70 Add Flash Attention support on ROCM (#121561)
This patch addresses the major limitations in our previous [PR #115981](https://github.com/pytorch/pytorch/pull/115981) through the new dedicated repository [AOTriton](https://github.com/ROCm/aotriton)

- [x] Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`).
    * MI300X is supported. More architectures will be added once Triton support them.
- [x] Only supports power of two sequence lengths.
    * Now it support arbitrary sequence length
- [ ] No support for varlen APIs.
    * varlen API will be supported in the next release of AOTriton
- [x] Only support head dimension 16,32,64,128.
    * Now it support arbitrary head dimension <= 256
- [x] Performance is still being optimized.
    * Kernel is selected according to autotune information from Triton.

Other improvements from AOTriton include
* Allow more flexible Tensor storage layout
* More flexible API

This is a more extensive fix to #112997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121561
Approved by: https://github.com/malfet, https://github.com/atalman
2024-03-12 01:16:53 +00:00
Kefei Lu
3a5f48d55f Port remove_split_ops to PT2 pre-grad passes (#121674)
Summary: For OEMAE, this contributes 14% of the total DPER pass perf gain.

Test Plan:
Run test cases

Run oemae lower benchmark with and with this fix. FLOP/s 29 -> 34.

Reviewed By: frank-wei

Differential Revision: D54711064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121674
Approved by: https://github.com/frank-wei
2024-03-12 01:15:19 +00:00
Elias Ellison
5b5d423c2e Benchmark templates (#118880)
Adding support for benchmarking templates in `benchmark_fusion`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118880
Approved by: https://github.com/shunting314
2024-03-11 23:55:13 +00:00
Mu-Chu Lee
7676433012 [AOTInductor] Reuse generated kernels between constant graph and main graph (#121564)
Summary: We copy the src_to_kernel from constant graph to main graph so that we could avoid generating duplicating kernels. And pass throught the name counter such that no duplicated names will be generated.

Test Plan: Included in commit

Differential Revision: D54706767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121564
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2024-03-11 22:44:38 +00:00
Andrew Gu
272cf29e4d [FSDP2][BE] Refactored check_1d_sharded_parity to use mesh (#121357)
Eventually, we should just have one unified way to check for parity between a `DTensor`-sharded model and a replicated model. This PR is a small refactor to work toward that. One current gap to use this `check_sharded_parity` function for 2D is that FSDP's `(Shard(0), Shard(0))` layout differs from that of the `DTensor` APIs since FSDP shards on dim-0 after TP shards on dim-0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121357
Approved by: https://github.com/weifengpy
ghstack dependencies: #121360
2024-03-11 22:34:42 +00:00
PyTorch MergeBot
fd0dbcd891 Revert "Batch Norm Consolidation (#116092)"
This reverts commit 7b4f70eda5.

Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))
2024-03-11 22:22:41 +00:00
PyTorch MergeBot
b2f09c1859 Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681)"
This reverts commit d27509c384.

Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/xmfan due to breaking internal builds, see D54707287 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1989542344))
2024-03-11 22:18:36 +00:00
Alexander Grund
d1f45a93af Check for releasing GIL at compiletime (#116695)
Introduce `conditional_gil_scoped_release` and use it in `wrap_pybind_function*` to avoid a runtime branch making the code cleaner and faster.

@albanD This is the GIL change extracted from #112607 as discussed.

Also fixes a potential use of a moved-from object introduced in #116560:
- `f` is captured by value in a lambda that may be used called times
- After `std::move(f)` the lambda is not safe to call anymore

CC @cyyever for that change
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116695
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-03-11 22:04:56 +00:00
Sam Larsen
fd13a56f61 Refactor some testing helpers for FX graph cache testing (#121520)
Summary: I plan to enable the FX graph cache for more inductor unit tests. This PR does some refactoring to prepare by moving the `TestCase` base class to `torch._inductor.test_case` (which mirrors the existing `torch._dynamo.test_case`). In a subsequent diff, I'll modify tests importing `torch._dynamo.test_case.TestCase` to use `torch._inductor.test_case.TestCase` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121520
Approved by: https://github.com/eellison
2024-03-11 21:46:27 +00:00
Kefei Lu
fc712311ce port fuse_parallel_linear (without changing weights) to PT2 pre-grad (#121617)
Summary: Does not change weights structure so compatible with const folding and realtime weights update

Test Plan: run added test cases

Reviewed By: frank-wei

Differential Revision: D53843428

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121617
Approved by: https://github.com/frank-wei
2024-03-11 20:51:11 +00:00
Zhenghao Zhao
3461404869 [pt2 export]fix name collision on constant name (#121145)
Summary: Taking the right most part of the fqn will cause name conflict when having multiple instances of the same class. Changed to replace "." in fqn by "_" to avoid invalid syntax in input args.

Test Plan: added test case

Differential Revision: D54435230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121145
Approved by: https://github.com/zhxchen17
2024-03-11 20:40:59 +00:00
Jason Ansel
9aa3fedb75 Slightly faster FX graph iterator (#121611)
Before:
```
iterating over 100000000 FX nodes took 5.9s (16830686 nodes/s)
```

After:
```
iterating over 100000000 FX nodes took 5.0s (19937698 nodes/s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121611
Approved by: https://github.com/oulgen
2024-03-11 20:00:19 +00:00
Daniel Herrera
dccc1ca839 [torch] Use __prepare_scriptable__ for closures (#121553)
Summary:
This fixes a case left incomplete by https://github.com/pytorch/pytorch/pull/106229
The object is using __prepare_scriptable__ correctly inside of torch.jit.script()
but the clousre that is obtained below is using the non-prepared version.
This causes issues when the prepared and non-prepared versions are in different python modules.

Test Plan:
```
buck2 run mode/opt caffe2/test:jit -- -r test_decorator
```

Differential Revision: D54308741

Re-exporting, as #120806 #121307 were not properly merged.

Co-authored-by: Daniel Herrera <dherrera@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121553
Approved by: https://github.com/huydhn, https://github.com/seemethere
2024-03-11 19:14:19 +00:00
Nikita Shulga
e29004615f Add NEON accelerated torch.mv kernel (#119992)
This reduces `torch.mv` time for 256x768 matrix by 256 element vector from 209 usec to 16 usec for nontransposed case and from 104 to 18 usec if transposed

Also, add fp16-accumulation flavor to the same ops (controlled by private `torch._C._set_cpu_allow_fp16_reduced_precision_reduction` which yields a slightly better numbers), summarized in the following table

| op | original | F32+NEON | F16+NEON|
| ---| -------- | ---------- | ----- |
| torch.mv(m, v) | 209.53 usec | 16.25 usec | 14.68 usec |
| torch.mv(m.t(), v) |  104.80 usec | 28.68 usec | 24.82 usec |

Test plan: CI on MacOS for both CPU and MPS test fp32<->fp16 matmul consistency ( For example "test_output_grad_match_nn_functional_linear_cpu_float16" passes if fp32-reductions are performed, but fails if fp16 accumulation is used)

To investigate:
 - why replacing `sum0Vec = vaddq_f32(sum0Vec, vmulq_f32(a0Vec, xVec));` with `sum0Vec = vfmaq_f32(sum0Vec, a0Vec, xVec);` slows down gemv from 16.2 to 18.2 usec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119992
Approved by: https://github.com/mikekgfb
2024-03-11 16:00:01 +00:00
Thiago Crepaldi
6c11d3ce0c Add support to save safetensors checkpoint directly into onnx (#121001)
Currently, when `torch.onnx.dynamo_export` is called within `torch.onnx.enable_fake_mode`, all the external pytorch checkpoint files used to initialize the model are automatically and used by `torch.onnx.ONNXProgram.save` to recreate the initializers for
the newly exported ONNX model.

This API extends the mechanism for HuggingFace models that use safetensors weights. This PR detects safetensors state files and converts them to PyTorch format using mmap on a temporary file, which is deleted after conversion is finished.

Without this PR, the user would have to convert the safetensors files to pytorch format manually and feed it to `torch.onnx.ONNXProgram.save` manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121001
Approved by: https://github.com/BowenBao, https://github.com/malfet
2024-03-11 15:21:59 +00:00
FFFrog
485f8ebc07 add __repr__ function to FunctionSchema for Python (#121484)
Fixes #118566

Unlike **OpOverload** or **OpOverloadPacket**, there is a lot of complex information in the schema, so for me keeping it as is is probably a good choice, but in theory the **\_\_repr__** function should show the class name as well as some other key information.

If you have any choices, please show me, thank you.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121484
Approved by: https://github.com/Skylion007
2024-03-11 15:16:50 +00:00
Xilun Wu
605c0a28aa [dtensor][debug] force visualize_sharding not to print for empty tensors (#121217)
**Summary**
Current `visualize_sharding` code cannot print for empty DTensor objects which leads to an exception. This PR skips the print logic if the DTensor passed in has 0 element.
<img width="2165" alt="Pasted Graphic" src="https://github.com/pytorch/pytorch/assets/12968408/fa40b5e7-dad7-4d3a-a485-6a18067320ff">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121217
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385, #121382
2024-03-11 09:22:49 +00:00
Xilun Wu
3a5ab17bdc [dtensor][debug] visualize_sharding skip if the current rank is not in mesh (#121382)
**Summary**
We should skip the `visualize_sharding()` function on those ranks that are not a part of the DTensor's mesh. If not, exception will be thrown in current visualize logic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121382
Approved by: https://github.com/wanchaol
ghstack dependencies: #121385
2024-03-11 09:22:49 +00:00
Xilun Wu
b383123e37 [dtensor][debug] visualize_sharding only compute offset on the first rank in mesh (#121385)
**Summary**
avoid computing on ranks where we do not plan to visualize the DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121385
Approved by: https://github.com/wanchaol
2024-03-11 09:22:31 +00:00
kungyork
9c50ecc84b Fix get_rank under a non-default group. (#120481)
Fixes #120213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120481
Approved by: https://github.com/yifuwang
2024-03-11 05:40:54 +00:00
Jason Ansel
7cc476ea16 [dynamo] Fix support for nn.Parameter constructor (part 1) (#120163)
This captures calls to `torch.nn.Parameter` by lifting them to graph inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120163
Approved by: https://github.com/albanD, https://github.com/yanboliang
ghstack dependencies: #121086
2024-03-11 05:14:42 +00:00
Jason Ansel
32488b0664 [dynamo] Support _unsafe_set_version_counter (#121086)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121086
Approved by: https://github.com/yanboliang
2024-03-11 05:14:42 +00:00
Ze Sheng
7a4e451184 [Dynamo] Fix function overrides (#120885)
To check existence of `__torch_function__`, the code intended to iterate each element but got `TupleVariable` when the ordinary `has_torch_function()` was being used. Needs further unpack in this case

Fixes #120653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120885
Approved by: https://github.com/yanboliang
2024-03-11 02:18:43 +00:00
Kefei Lu
f11f2b0d55 split predispatch pass into multiple passes (#121592)
Summary:
It's very difficult to debug the passes ineffectiveness, with them mingled in one single pass container. Here we extract them into seperate passes with diagnostics info.

This is also required for a later change, where we must run shape prop on each of these passes, in order for the subsequent passes to have the correct shape information.

Reviewed By: frank-wei

Differential Revision: D53579545

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121592
Approved by: https://github.com/frank-wei
2024-03-11 00:30:55 +00:00
Avik Chaudhuri
13e8181b7b relax assertion on fake shape (#121599)
Summary: Seems like if you use `capture_pre_autograd_graph` fake tensor shapes can be ints instead of symints.

Test Plan: fixes the AssertionError in N5057219

Differential Revision: D54729142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121599
Approved by: https://github.com/angelayi, https://github.com/BoyuanFeng
2024-03-10 22:51:10 +00:00
Oguz Ulgen
660ec3d38d [Export] Fix bug removing node from wrong graph (#121574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121574
Approved by: https://github.com/ydwu4
2024-03-10 04:46:11 +00:00
Yifu Wang
41286f1505 [IntraNodeComm] fix a hybridCubeMeshAllReduceKernel breakage caused by a recent refactor (#121575)
`hybridCubeMeshAllReduceKernel` uses the latter half of p2p buffers as relay buffers. The relay buffer address is calculated using a bf16 base pointer and the buffer size in byte. The breakage was caused by not taking element size into account.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121575
Approved by: https://github.com/Chillee
2024-03-10 00:55:25 +00:00