Commit Graph

85 Commits

Author SHA1 Message Date
Sheng Qin
d25617255c Fix AOTI update_constant_buffer issue. (#149243)
Summary:
In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like
```
I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights
*** Aborted at 1741652964 (Unix time, try 'date -d 1741652964') ***
*** Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: ***
    @ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453
    @ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t*, void*)
                       ./fbcode/folly/fibers/GuardPageAllocator.cpp:237
    @ 000000000004455f (unknown)
                       /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
                       -> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
    @ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque*> > > const&, bool, bool)
```

Test Plan:
1) Generate lowered merge net
```
CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par  --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor  --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false
```

2) Load net predictor
```
CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true
```

Reviewed By: hl475

Differential Revision: D71236710

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243
Approved by: https://github.com/hl475, https://github.com/jingsh
2025-03-17 22:10:57 +00:00
Bin Bao
bdf57fb8f7 [AOTI][refactor] Split MiniArrayRef into a separate header (#149073)
Summary: MiniArrayRef is a common utility and will be used by the libtorch-free AOTI.

Differential Revision: [D71064657](https://our.internmc.facebook.com/intern/diff/D71064657)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149073
Approved by: https://github.com/yushangdi
2025-03-13 11:57:32 +00:00
Joel Schlosser
85467ed063 Fix for AOTI + CUDAGraphs when calling from Python (#148601)
**Background**: I've been comparing performance of torch.compile vs. torch.export + AOTI (specifically, loaded from Python) on the Flux model and found a ~1.4% performance decrease with the latter. The trace shows that CUDAGraphs are not utilized for torch.export + AOTI, leading to higher overhead.

When trying to manually CUDAGraph the loaded, previously exported + AOTIed model (thanks to @eellison for the logic here), I get:
```
Error: operation not permitted when stream is capturing
```

@desertfire confirms that this is due to multi-threading logic on the AOTI runtime side (in `AOTIModelContainer` / `AOTIModel`) conflicting with the use of CUDAGraphs.

**Fix**: This PR takes the approach of providing an alternate, single-threaded method for running loaded models with the AOTI runtime. Details:
* Python side introduces a new flag to enable this behavior (needs a better name): `torch._inductor.package.load_package(..., run_single_threaded=False)`
    * This flag is passed down to the C++ side's `AOTIModelPackageLoader`, which passes it to the `CreateAOTIModelRunnerFunc` during `AOTIModelContainerRunner` construction.
* C++ side introduces single-threaded alternatives to model running and model container running:
    * `AOTIModelContainer.run_single_threaded()` / `AOTIModel.run_single_threaded()`. The interfaces match those of `run()`, but the synchronization logic has been removed.
    * Introduces `AOTInductorModelContainerRunSingleThreaded` to AOTI's `interface.h`; this is invoked by the `AOTIModelContainerRunner` utility class when `run_single_threaded=true`.

I've verified on both a small repro and my real-world use case that I can manually CUDAGraph a loaded model that was previously exported + AOTIed.

**Future work:**
* Flip default value to `run_single_threaded=True` as Python-side inference doesn't take advantage of the AOTI runtime thread pool
    * There are some BC concerns here - models need to be re-serialized so the .so contains the new `AOTInductorModelContainerRunSingleThreaded` interface func. We can flip the default value and warn (instead of crashing) if the `AOTInductorModelContainerRunSingleThreaded` symbol does not exist.
* Compose with cudagraph trees as opposed to manual cuda graph wrapping

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148601
Approved by: https://github.com/desertfire
2025-03-08 02:44:14 +00:00
Zhuoran Zhao
dfb4094b9c Skip buffer in dense update (#148533)
Summary:
as title.

PyTorch Module buffer will not be published in delta publishing.  In Quinn's previous diff, constant type annotations have been introduced.

In addition to skip constant, we also need to skip buffer if it is not found in the user-provided delta weights list

Test Plan: https://docs.google.com/document/d/1wiqUo0PyZ4g6YJIJlL_LE084ZEuE74iu74gZjqGGjWY/edit?tab=t.0#heading=h.dby6cwiw1xrn

Differential Revision: D69553929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148533
Approved by: https://github.com/22quinn, https://github.com/jingsh
2025-03-07 01:59:58 +00:00
Benjamin Glass
b160dda743 cpp_wrapper: reduce memory usage by removing unneeded temporaries (#147403)
This PR contains a set of interrelated changes, listed below, with the upshot that compiled model memory usage in `cpp_wrapper` mode is now roughly equivalent to the default inductor mode.

Changes:

1. Refactor `reinterpret_view` calls in `cpp_wrapper` to always return a temporary RAII tensor object, rather than saving off a "temporary" tensor handle that persisted through the end of the function. This matches the behavior of the base Python wrapper class, and is responsible for majority of the memory usage reductions.
2. Eliminate nearly all other cases where a "temporary" tensor handle was saved off (with the exception of one or two places where the tensor would immediately be destroyed by going out-of-scope). This necessitated some ugly-looking code to handle `Optional[Tensor]` and `Optional[Sequence[Any]]`, since `Optional` is passed by pointer into the C-shim functions (making passing temporary objects difficult). This code is justified by the fact that it only appears in controlled circumstances that we auto-generate, so there are minimal user-facing footguns.
3. Delete the list containing the input tensors to the `cpp_wrapper` main function after casting them to `AtenTensorHandle` objects, which have an internal reference count keeping them alive.

The [TorchInductor benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Sat%2C%2015%20Feb%202025%2018%3A38%3A08%20GMT&stopTime=Sat%2C%2022%20Feb%202025%2018%3A38%3A08%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/benjaminglass1/73/head&lCommit=4d5edaf67e80ca9ca36d301af1ded13967a04790&rBranch=main&rCommit=e1bf892d9004a4dba0748d0eda5c3b4eced0ea70) I ran shows the increased memory compression.

Differential Revision: [D70648897](https://our.internmc.facebook.com/intern/diff/D70648897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147403
Approved by: https://github.com/desertfire
2025-03-06 16:08:16 +00:00
Mu-Chu Lee
a5c0dab900 [AOTInductor] Guard RAII_cpuMalloc with macro (#147150)
Summary: Silence RAII_cpuMalloc(size_t) defined but not used [-Wunused-function]

Test Plan: Existing tests

Differential Revision: D69623481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147150
Approved by: https://github.com/henrylhtsang
2025-02-14 23:21:35 +00:00
Mu-Chu Lee
e21181642f [AOTInductor] Align behavior between CPU and GPU (#145459)
Summary:
(1) Make sure CPU and GPU doesn't have different implementation and behavior when calling from the same path and API. Only difference between CPU and GPU after this PR should ONLY be the running hardware.
(2) This PR fixes the issue of memory access when it==constants_map.end()
(3) This PR resolves T179437596

Test Plan: buck2 run mode/dev sigmoid/inference/test:e2e_test_cpu

Differential Revision: D68540744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145459
Approved by: https://github.com/desertfire, https://github.com/hl475
2025-02-13 09:50:18 +00:00
Davide Italiano
d80eef7c6d [inductor] Guard a member variable with a define. (#146278)
It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind:

/Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field]
  412 |   size_t blob_size_;
      |          ^
1 warning generated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-02-03 02:20:08 +00:00
Bin Bao
16420a78eb [AOTI] Remove AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 (#146039)
Summary: The AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 macro was used to solve a FC issue and it can be removed now.

Test Plan: CI

Differential Revision: D68871245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146039
Approved by: https://github.com/yushangdi, https://github.com/hl475
2025-01-30 19:01:19 +00:00
Mu-Chu Lee
ac87388e61 [AOTInductor] Refactor CPU and GPU to remove ifdef macros (#145639)
Summary: Remove #ifdef USE_CUDA macros through some refactor

Test Plan: Refactor code, existing tests.

Differential Revision: D68636743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145639
Approved by: https://github.com/desertfire
2025-01-28 16:46:00 +00:00
Benjamin Glass
d5629889f1 cpp_wrapper: Properly handle scalars when input to tensor arguments (#144910)
Additionally, reduce code duplication in `cpp_wrapper_cpu_array_ref.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144910
Approved by: https://github.com/desertfire
2025-01-24 02:06:35 +00:00
Chong Gu
5b5d7016c8 Remove stable_partition for ARM AOTI Runtimes (#142394)
Summary: This function call will cause OOM issues on ARM machines with multi-threaded predictors (reason behind this is still being investigated), we replace it with the standard partition instead.

Differential Revision: D66904296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142394
Approved by: https://github.com/frank-wei
2024-12-17 19:19:04 +00:00
Bin Bao
6680a83e89 [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268

Co-authored-by: Bin Bao <binbao@meta.com>
2024-12-10 05:05:08 +00:00
Scott Wolchok
274223d719 Add and use borrow_arrayref_tensor_as_tensor (#142183)
Differential Revision: [D66847773](https://our.internmc.facebook.com/intern/diff/D66847773/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142183
Approved by: https://github.com/desertfire, https://github.com/hl475
ghstack dependencies: #142340, #142182
2024-12-09 22:23:21 +00:00
Scott Wolchok
18d25aa7aa Rename convert_arrayref_tensor_to_tensor to copy_arrayref_tensor_to_tensor (#142182)
Be explicit about what we are doing, in preparation for adding borrow_arrayref_tensor_as_tensor.

Differential Revision: [D66847772](https://our.internmc.facebook.com/intern/diff/D66847772/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142182
Approved by: https://github.com/desertfire
ghstack dependencies: #142340
2024-12-09 22:23:21 +00:00
PyTorch MergeBot
219e9c83a5 Revert "[AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)"
This reverts commit 854d83133b.

Reverted https://github.com/pytorch/pytorch/pull/140269 on behalf of https://github.com/clee2000 due to breaks forward compatibility?  D66937097 ([comment](https://github.com/pytorch/pytorch/pull/140269#issuecomment-2528828555))
2024-12-09 17:33:28 +00:00
xinan.lin
854d83133b [AOTI XPU] Support AOT Inductor for Intel GPU. (#140269)
This PR add XPU support for AOT Inductor, and reuse the corresponding UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140269
Approved by: https://github.com/desertfire, https://github.com/EikanWang
ghstack dependencies: #140268
2024-12-07 19:22:04 +00:00
Mu-Chu Lee
b08bc07cd7 [AOTInductor] Option to not include weight in .so (#141997)
Summary: Add an option in config to not include weights in .so

Test Plan: `test/inductor:test_aot_inductor -- -r test_so_without_weight_cuda`

Reviewed By: desertfire

Differential Revision: D65968885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141997
Approved by: https://github.com/desertfire
2024-12-05 03:35:54 +00:00
Benjamin Glass
54adbbf6b8 cpp_wrapper: Add support for MemoryFormat arguments (#141367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367
Approved by: https://github.com/desertfire
2024-12-02 20:40:24 +00:00
xinan.lin
4742080ed9 [AOTI XPU] Enable Cpp wraper for Intel GPU. (#135318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135318
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-11-26 11:51:32 +00:00
Bin Bao
dcf22fa58c [AOTI][refactor] Add sizes and strides util functions (#140449)
Summary: Similar to https://github.com/pytorch/pytorch/pull/139895, add sizes and strides methods to RAIIAtenTensorHandle and ConstantHandle, to increase the code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140449
Approved by: https://github.com/chenyang78
ghstack dependencies: #140447, #140448
2024-11-14 16:48:43 +00:00
Bin Bao
80870f62f0 [AOTI][refactor] Switch remaining aoti_torch_get_data_ptr (#140448)
Summary: https://github.com/pytorch/pytorch/pull/139895 added data_ptr(), but there is a remaining place in cpp_wrapper_gpu.py didn't switch over. Also moved a few AtenTensorHandle related utility functions from arrayref_tensor.h to utils.h.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140448
Approved by: https://github.com/chenyang78
ghstack dependencies: #140447
2024-11-14 01:40:59 +00:00
Bin Bao
d0ffd6d142 [AOTI] Add data_ptr to RAIIAtenTensorHandle (#139895)
Summary: To increase the readbility of the generated code. This is not BC-breaking, because RAIIAtenTensorHandle is implemented as header-only.

Differential Revision: [D65547216](https://our.internmc.facebook.com/intern/diff/D65547216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139895
Approved by: https://github.com/chenyang78
2024-11-07 01:36:28 +00:00
cyy
7f387fa612 [10/N] Fix extra warnings brought by clang-tidy-17 (#139385)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139385
Approved by: https://github.com/Skylion007
2024-11-04 00:47:19 +00:00
cyyever
456c87c8a2 [8/N] Fix extra warnings brought by clang-tidy-17 (#139151)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139151
Approved by: https://github.com/ezyang

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-30 14:20:08 +00:00
Richard Barnes
42994234a6 std::value/std::type -> std::_v/std::_t (#138746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138746
Approved by: https://github.com/cyyever, https://github.com/malfet
2024-10-26 20:59:24 +00:00
cyy
a05b64a38f [5/N] Fix extra warnings brought by clang-tidy-17 (#138403)
Follows #137983
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138403
Approved by: https://github.com/ezyang
2024-10-21 02:59:54 +00:00
cyy
a2396b2dd8 [2/N] Fix extra warnings brought by clang-tidy-17 (#137459)
Follows #137407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
Quinn Zhu
d33638588e [aoti][inplace] Support skipping model buffers (#136770)
Summary: Some AOTI tensor constants may be model buffers that never needs to be updated.

Differential Revision: D62777502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136770
Approved by: https://github.com/muchulee8
2024-09-30 18:28:42 +00:00
Bin Bao
da5c7b6f4e [AOTI] Set CUDA device for torch._export.aot_load (#136715)
Summary: Fixes https://github.com/pytorch/pytorch/issues/136369. When a CUDA device with index is specified when calling torch._export.aot_load, we need to specify the CUDA device when running model.so.

Differential Revision: [D63438335](https://our.internmc.facebook.com/intern/diff/D63438335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136715
Approved by: https://github.com/angelayi
2024-09-26 22:20:12 +00:00
cyy
bdf57da6a6 [3/N] Enable clang-tidy on torch/csrc/inductor (#132101)
Follows #132040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132101
Approved by: https://github.com/Skylion007
2024-07-30 13:04:57 +00:00
cyy
ab912b7fef [2/N] Fix clang-tidy warnings in inductor (#132040)
Follows #131979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132040
Approved by: https://github.com/Skylion007
2024-07-29 18:41:24 +00:00
Bin Bao
945946e817 [AOTI] Fix another ABI-compatible CPU issue (#131798)
Summary: This problem is seen on AOTI CPU dashboard runs, a cpp compilation error because ConstantHandle::get doesn't exist. This PR adds ConstantHandle::get so that the interface is consistent with RAIIAtenTensorHandle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131798
Approved by: https://github.com/zou3519, https://github.com/chenyang78
ghstack dependencies: #131791
2024-07-26 11:27:58 +00:00
Ying Zhao
b84036e3fb [AOTI] Fix test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation (#129173)
Fixes #122978
## Summary
To fix compilation error for test test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation

- Error 1
```
error: no matching function for call to ‘torch::aot_inductor::ArrayRefTensor<float>::ArrayRefTensor(float [1], const int64_t [0], const int64_t [0], int&, int32_t&)’
  613 |     ArrayRefTensor<float> buf3(buf3_storage, int_array_6, int_array_6, cached_torch_device_type_cpu, this->device_idx_);
      |                                                                                                                       ^
...
torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:188:35: note:   no known conversion for argument 2 from ‘const int64_t [0]’ {aka ‘const long int [0]’} to ‘torch::aot_inductor::MiniArrayRef<const long int>’
  188 |       MiniArrayRef<const int64_t> sizes,
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
```
Fix: added constructor for empty array in arrayref_tensor.h
- Error 2
```
error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’}
  625 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
      |                                                         ^~~~
      |                                                         |
      |                                                         torch::aot_inductor::ArrayRefTensor<float>
```
Fix: in cpp_wrapper_cpu.py, added codegen to call convert ArrayRefTensor to AtenTensorHandle first.
## Test Plan
```
python test/inductor/test_aot_inductor.py -k AOTInductorTestABICompatibleCpuWithStackAllocation.test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation
```

Before the fix, detailed in  #122978:
```
 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
      |                                                         ^~~~
      |                                                         |
      |                                                         torch::aot_inductor::ArrayRefTensor<float>
/home/yingzhaoseattle/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/utils.h:34:8: note: in definition of macro ‘AOTI_TORCH_ERROR_CODE_CHECK’
Ran 1 test in 4.377s
FAILED (errors=1)
```
After the fix

```
/home/yingzhaoseattle/pytorch/torch/backends/cudnn/__init__.py:107: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('extern_calls', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 9.633s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129173
Approved by: https://github.com/chenyang78
2024-06-28 16:27:42 +00:00
Wu, Chunyuan
4a997de8b9 [AOTI] support freezing for MKLDNN (#124350)
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.

This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.

We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.

### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```

### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
 https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-05-25 07:15:36 +00:00
PyTorch MergeBot
5ae9daa4a2 Revert "[AOTI] support freezing for MKLDNN (#124350)"
This reverts commit 654afb6f3a.

Reverted https://github.com/pytorch/pytorch/pull/124350 on behalf of https://github.com/clee2000 due to Seems to have broken inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_freezing_non_abi_compatible_cpu 654afb6f3a https://github.com/pytorch/pytorch/actions/runs/9224838183/job/25382780192 ([comment](https://github.com/pytorch/pytorch/pull/124350#issuecomment-2129889809))
2024-05-24 16:03:07 +00:00
Wu, Chunyuan
654afb6f3a [AOTI] support freezing for MKLDNN (#124350)
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.

This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.

We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.

### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```

### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
 https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-05-24 13:34:04 +00:00
Scott Wolchok
3d8b903d95 [PyTorch] Remove ArrayRefTensor::numel_ (#124516)
ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion.

Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-04-20 02:44:20 +00:00
Nikita Shulga
1ba85b34dd [AOTI] Enbale mmaped weights when CUDA is used (#124346)
By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers

It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346
Approved by: https://github.com/mikekgfb, https://github.com/desertfire
2024-04-19 04:47:27 +00:00
Nikita Shulga
416f532753 [AOTI] Serialize large weights (#123002)
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config.aot_inductor.force_mmap_weights`

Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question

Added unites to validate that it works as expected
TODO:
  - Extend support to CUDA
  - munmap region if the same library is reused

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
2024-04-11 06:39:58 +00:00
PyTorch MergeBot
a65e9a06f0 Revert "[AOTI] Serialize large weights (#123002)"
This reverts commit 27eb5daee4.

Reverted https://github.com/pytorch/pytorch/pull/123002 on behalf of https://github.com/DanilBaibak due to There is conflict to land the diff internally ([comment](https://github.com/pytorch/pytorch/pull/123002#issuecomment-2048215990))
2024-04-10 18:54:31 +00:00
Nikita Shulga
27eb5daee4 [AOTI] Serialize large weights (#123002)
But appending them to the end of the shared library and mmaping afterwards
Disabled by default, but overridable by `config._force_mmap_aoti_weights`

Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question

Added unites to validate that it works as expected
TODO:
  - Extend support to CUDA
  - munmap region if the same library is reused

Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb
2024-04-09 22:18:57 +00:00
Bin Bao
0ff6155eee [AOTI] Support module buffer mutation (#123164)
Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164
Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov
2024-04-02 20:25:26 +00:00
Bin Bao
375a8041ed [AOTI][refactor] Improve logging (#122932)
Summary: Improve some logging msgs, and change a data type to remove a compile time warning.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122932
Approved by: https://github.com/chenyang78
2024-03-29 14:02:23 +00:00
Mu-Chu Lee
091a24495b [AOTInductor] Support use_runtime_constant_folding for CPU. (#122563)
Summary:
We allow CPU to use the config use_runtime_constant_folding.
Changes include
1. Rearrange USE_CUDA flags. Add CPU sections that consumes memory directly.
2. Codegen changes to accomodate cpp fusions for CPU only. Specifically, we shouldn't generate 2 headers that would cause re-declaration.

Test Plan: Activate tests that were deactivated for CPU before.

Reviewed By: khabinov

Differential Revision: D55234300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122563
Approved by: https://github.com/chenyang78
2024-03-28 17:49:05 +00:00
Mu-Chu Lee
2367d0dacd [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562) (#122690)
Summary:

During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Reviewed By: zoranzhao

Differential Revision: D55354548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122690
Approved by: https://github.com/khabinov
2024-03-26 23:25:15 +00:00
PyTorch MergeBot
55f36d1ada Revert "[AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)"
This reverts commit 57a3d00b06.

Reverted https://github.com/pytorch/pytorch/pull/122562 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/122562#issuecomment-2019262415))
2024-03-26 02:18:19 +00:00
Mu-Chu Lee
57a3d00b06 [AOTInductor] Add tensor_constantX to pass constant buffer update's check (#122562)
Summary:
During tracing, some constants (tensor_constant{idx}) are being generated internally.
Those constants are neither parameters or buffers, and users have zero control on them.

To accomodate this, we should allow users not passing in those constants generated internally but still be able the constants in the model.

Test Plan:
Included in commit.
```
build/bin/test_aot_inductor
```

Differential Revision: D55286634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122562
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2024-03-25 22:05:20 +00:00
Bin Bao
7e598c0053 [Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309)
Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309
Approved by: https://github.com/chenyang78
2024-03-07 14:22:06 +00:00
Bin Bao
fa0e39560c [AOTI] Fix a typo (#120094)
Differential Revision: [D53861810](https://our.internmc.facebook.com/intern/diff/D53861810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120094
Approved by: https://github.com/khabinov, https://github.com/sijiac
2024-02-17 05:28:58 +00:00