pytorch/torch/csrc
anwang cd68559d04 [Inductor] Support native Inductor as backend for MTIA (#158526)
This diff/PR includes the changes to support native Inductor integration for MTIA. The goal is to support `torch.compile(backend="inductor")` for MTIA. Inductor should generate code(triton kernel + python wrapper code) similar to CUDA. And the triton kernels can be launched eagerly.

The changes include:
- Add MTIA device interfaces used by Dynamo and Inductor, including APIs on device, stream, event, etc.
- Add required torch.mtia APIs, like is_bf16_supported, memory_allocated, set_stream_by_id, etc.
- MTIA specific codegen logic, for example, loading MTIA dynamic_library.
- Other necessary changes to integrate with Inductor codegn, following other devices like CUDA, XPU.
- Integrate with the [empty_strided_mtia](https://www.internalfb.com/code/fbsource/[0d017d3a4a1bdff7253f9c66a9f38e77bd62166b]/fbcode/caffe2/aten/src/ATen/native/mtia/EmptyTensor.cpp?lines=49%2C63%2C71%2C74%2C78) API that we’ve added for the new MTIA ATen backend.
- A change in Inductor runtime to avoid re-initialize MTIADriver.
- BUCK changes to include ATen-mtia in Inductor, and to use -USE_MTIA preprocessor flag.
- Update `test_mnist_e2e.py` to cover native Inductor as backend, using the `--use_native_inductor` flag.
- Add a personal script(`scripts/anwang/run_native_inductor_script.py`) for testing purpose.

Note:
- This approach(option 3) aims to provide a pytorch native approach of Inductor integration for MTIA, minimizing the onboarding overhead. The downside of this approach is that it doesn't leverage MTIA specific graph optimization, and is limited to eagerly launch overhead.
- MTIA will support another approach(option 2) to provide best performance, based on WrapperFxCodegen. We should be able to reuse the fundamental changes of this diff for option 2, like the device interfaces, steam/event APIs, etc, especially as WrapperFxCodegen inherits PythonWrapperCodegen.

Internal:
References:
- [post for context](https://fb.workplace.com/groups/mtiasw/permalink/1718377262384606/)
- [Inductor integration discussion(option 1/2/3)](https://docs.google.com/document/d/1p6363OXtVIRv1hPoaKlRSK3j-iir3QIbDd5bjyqCNig/edit?tab=t.0#heading=h.7s4ns6wcnhmb)
- [Project design doc(option 3)](https://docs.google.com/document/d/1jXUmhgoV9WvkMf-bcY3Od_kK9K_RDOdgHdt1LoQ5Tc4/edit?tab=t.0#heading=h.y43gwdqlv46w)
- [early prototying diff](https://www.internalfb.com/diff/D75110196)
- [MPS integration PR](https://github.com/pytorch/pytorch/pull/153959)
- [empty_strided_xpu PR](https://github.com/pytorch/pytorch/pull/126678)

Differential Revision: [D78458745](https://our.internmc.facebook.com/intern/diff/D78458745/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158526
Approved by: https://github.com/blaine-rister, https://github.com/jansel, https://github.com/eellison
2025-07-26 08:16:34 +00:00
..
api Remove unsafe PyTorchError constructor (#154961) 2025-07-11 18:22:53 +00:00
autograd [Profiler] Fix lost C call events problem in Python 3.12.0-3.12.4 (#155446) 2025-07-25 21:44:57 +00:00
cpu
cuda Revert "Deprecate overleap functions in CUDAAllocatorConfig, use AcceleratorAllocatorConfig instead (#156165)" 2025-07-15 18:24:36 +00:00
deploy Revert "[BE] Remove torch deploy | remove torch deploy specific files (#158290)" 2025-07-25 16:09:39 +00:00
distributed support scalar tensor for functional all_gather (#149913) 2025-07-25 22:38:08 +00:00
dynamo [Inductor] Support native Inductor as backend for MTIA (#158526) 2025-07-26 08:16:34 +00:00
export [schema_upgrader] add C++ upgrader for json based upgrading (#156761) 2025-06-28 18:15:06 +00:00
functorch
fx Fix clang-tidy bugprone* warnings (#148529) 2025-06-23 23:09:56 +00:00
inductor Enable generating generic c_shim that doesn't bypass dispatcher (#158974) 2025-07-25 21:59:14 +00:00
instruction_counter
jit Remove tensorexpr tests (#158928) 2025-07-26 01:21:01 +00:00
lazy Revert "[BE] remove torch deploy - conditionals (#158288)" 2025-07-25 16:09:39 +00:00
monitor Fix 'dllimport attribute ignored on inline function' (#157670) 2025-07-07 16:57:48 +00:00
mps [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317) 2025-06-23 02:57:41 +00:00
mtia [Inductor] Support native Inductor as backend for MTIA (#158526) 2025-07-26 08:16:34 +00:00
multiprocessing
onnx
profiler [Profiler] the doc of _ExperimentalConfig is incorrectly truncated by commas (#156586) 2025-07-16 04:10:49 +00:00
stable Enable generating generic c_shim that doesn't bypass dispatcher (#158974) 2025-07-25 21:59:14 +00:00
tensor
utils Throw invalid_argument instead of RuntimeError when parameters exceed… (#158267) 2025-07-25 23:49:46 +00:00
xpu Add device_id to XPU device properties (#156481) 2025-07-03 01:22:11 +00:00
copy_utils.h
CudaIPCTypes.cpp
CudaIPCTypes.h
DataLoader.cpp
DataLoader.h
Device.cpp Remove unsafe PyTorchError constructor (#154961) 2025-07-11 18:22:53 +00:00
Device.h
DeviceAccelerator.cpp Revert "Add unified memory APIs for torch.accelerator (#152932)" 2025-07-22 01:01:41 +00:00
DeviceAccelerator.h
Dtype.cpp
Dtype.h
DynamicTypes.cpp
DynamicTypes.h
empty.c
Event.cpp Fix clang-tidy bugprone* warnings (#148529) 2025-06-23 23:09:56 +00:00
Event.h
Exceptions.cpp Remove unsafe PyTorchError constructor (#154961) 2025-07-11 18:22:53 +00:00
Exceptions.h Raise BufferError for DLPack buffer-related errors. (#150691) 2025-07-20 00:46:21 +00:00
Export.h
Generator.cpp Remove unsafe PyTorchError constructor (#154961) 2025-07-11 18:22:53 +00:00
Generator.h
itt_wrapper.cpp
itt_wrapper.h
itt.cpp
itt.h
Layout.cpp
Layout.h
MemoryFormat.cpp
MemoryFormat.h
Module.cpp [ROCm] add flag torch.backends.miopen.immediate (#158951) 2025-07-25 04:01:51 +00:00
Module.h
PyInterpreter.cpp Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)" 2025-07-21 23:14:57 +00:00
PyInterpreter.h Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)" 2025-07-21 23:14:57 +00:00
PyInterpreterHooks.cpp [BE] Fix extra-semi warnings (#158730) 2025-07-22 01:05:03 +00:00
python_dimname.cpp
python_dimname.h
python_headers.h
QScheme.cpp
QScheme.h
README.md
serialization.cpp
serialization.h
Size.cpp Implemented Size.__radd__ (#152554) 2025-06-23 15:38:37 +00:00
Size.h
Storage.cpp Revert "[BE] Modify PyObjectSlot the assume only a single interpreter is in use (#158407)" 2025-07-25 16:09:39 +00:00
Storage.h Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)" 2025-07-21 23:14:57 +00:00
StorageMethods.cpp Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)" 2025-07-21 23:14:57 +00:00
StorageMethods.h
StorageSharing.cpp Revert "[BE] Make PyObjectSlot use a global PyInterpreter and remove (#158427)" 2025-07-21 23:14:57 +00:00
StorageSharing.h
Stream.cpp
Stream.h
stub.c
THConcat.h
THP.h
TypeInfo.cpp
TypeInfo.h
Types.h
utils.cpp [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317) 2025-06-23 02:57:41 +00:00
utils.h

csrc

The csrc directory contains all of the code concerned with integration with Python. This is in contrast to lib, which contains the Torch libraries that are Python agnostic. csrc depends on lib, but not vice versa.

There are a number of utilities for easing integration with Python which are worth knowing about, which we briefly describe here. But the most important gotchas:

  • DO NOT forget to take out the GIL with pybind11::gil_scoped_acquire before calling Python API or bringing a THPObjectPtr into scope.

  • Make sure you include Python.h first in your header files, before any system headers; otherwise, you will get error: "_XOPEN_SOURCE" redefined error. If you pay attention to warnings, you will see where you need to do this.

Notes

Note [Storage is not nullptr]

Historically, Torch supported nullptr storage, as a minor optimization to avoid having to allocate a storage object when it would be empty. However, this is actually a confusing special case to deal with, so by-in-large, PyTorch assumes that, in fact, storage is never nullptr.

One important case where this assumption is important is when tracking the CUDA device a tensor is stored in: this information is stored solely in the storage, so if a storage is nullptr, we lose this information.

Although storage is never nullptr, the data field of c10::StorageImpl may be nullptr. This mostly occurs when we want to pre-allocate an output tensor struct, but then have it be resized and filled with data by some operator: there's no point in allocating data for it in this case!

Files

Exceptions.h

Frequently when working with the Python API, you may call a function which returns an error. In this case, we want to return directly to the Python interpreter, so that this exception can be propagated accordingly; however, because the Python API is C-based, what actually will happen is it will return control to whatever C++ code called it. Similarly, if we raise a C++ exception, prior to returning to the Python interpreter, we must set the Python error flags, so it turns into a C++ exception.

Moreover, when using the following macros, the generated warnings will be converted into python warnings that can be caught by the user.

Exceptions define helpers for two main cases:

  • For code where you write the python binding by hand, HANDLE_TH_ERRORS, END_HANDLE_TH_ERRORS and an exception class python_error. You call them like this:
// Entry point from Python interpreter
PyObject* run(PyObject* arg) {
  HANDLE_TH_ERRORS
  ...
  if (!x) throw python_error();
  // From c10/Exception.h
  TORCH_CHECK(cond, "cond was false here");
  TORCH_WARN("Warning message");
  ...
  END_HANDLE_TH_ERRORS
}

The HANDLE_TH_ERRORS macro will catch all exceptions and convert them into an appropriate Python signal. python_error is a special exception which doesn't contain any info, instead it says, "An error occurred in the Python API; if you return to the interpreter, Python will raise that exception, nothing else needs to be done."

  • For code that you bind using pybind, HANDLE_TH_ERRORS and END_HANDLE_TH_ERRORS_PYBIND can be used. They will work jointly with pybind error handling to raise pytorch errors and warnings natively and let pybind handle other errors. It can be used as:
// Function given to the pybind binding
at::Tensor foo(at::Tensor x) {
  HANDLE_TH_ERRORS
  ...
  if (!x) throw python_error();
  // pybind native error
  if (!x) throw py::value_error();
  // From c10/Exception.h
  TORCH_CHECK(cond, "cond was false here");
  TORCH_WARN("Warning message");
  ...
  END_HANDLE_TH_ERRORS_PYBIND
}

GIL

Whenever you make any calls to the Python API, you must have taken out the Python GIL, as none of these calls are thread safe. pybind11::gil_scoped_acquire is a RAII struct which handles taking and releasing the GIL. Use it like this:

void iWantToUsePython() {
  pybind11::gil_scoped_acquire gil;
  ...
}

In general, the compiler will NOT warn you if you use Python functionality without taking out the GIL, so DO NOT FORGET this call.

utils/object_ptr.h

THPPointer is a smart pointer class analogous to std::shared_ptr, but which is overloaded to handle reference counting scheme of various objects which are not based on shared_ptr. The most important overloads are:

  • PyObject (so important we've aliased it as THPObjectPtr), which hooks into Python reference counting. (By the way, that means you MUST take out the GIL before bringing one of these into scope!)

  • The various TH tensor and storage types (e.g., THTensor), which hook into TH's reference counting. (TH's reference counting IS thread safe, no locks necessary.)