Summary:
In pytorch, tensor.to("cuda") behaves differently from tensor.to("cuda:0).
tensor.to("cuda") will read from thread local DeviceGuard, aka cuda::current_device(), to infer the device index.
TBEPermute is relying on this behavior to route output tensor to a device specified by current thread.
For this reason, we remove the normalizeDevice(), and disallow index-less cuda device in Placement.
Device-to-device mapping must be done between concrete device!
Test Plan:
CI
Rollback Plan:
Differential Revision: D78443109
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158489
Approved by: https://github.com/henryoier
Summary: pretty simple. if planner exists, which implies that planning is enabled, create a manager for each frame. the associated serial executor will use the withMemoryPlannner fn to ensure the deallocation is done after execution completes.
Test Plan: CI
Differential Revision: D73635809
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157053
Approved by: https://github.com/henryoier, https://github.com/georgiaphillips
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/cpp/nativert:c10_kernel_test
```
Differential Revision: D76825830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
Summary:
Moves OpKernel base class to PyTorch core. It is an abstract interface representing a kernel, which is responsible for executing a single Node in the graph.
Torch Native Runtime RFC: pytorch/rfcs#72
Test Plan:
buck2 run mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test
Rollback Plan:
Differential Revision: D76525939
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156011
Approved by: https://github.com/zhxchen17
By extracting both monitor thread and watchdog thread into a separate class this will help us learn what dependencies we have for each thread and it will kind of simplify the consolidation work for each thread (consolidating from thread per PG instance to per PG class)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155831
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
Summary:
Serialization contains utilities to deserialize a graph saved on disk in json format as defined in `torch/csrc/utils/generated_serialization_types.h` to the in-memory representation as defined in `torch/nativert/graph/Graph.h`
Test Plan:
buck2 run @mode/dev-nosan caffe2/test/cpp/nativert:serialization_test
Rollback Plan:
Differential Revision: D76012641
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155229
Approved by: https://github.com/zhxchen17
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.
We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.
What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.
Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)
Differential Revision: [D75799278](https://our.internmc.facebook.com/intern/diff/D75799278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
We split the large PR for adding Graph.h and Graph.cpp to nativert into 3 smaller PRs:
1. Add header file
2. Add source file
3. **Add test and build rules**
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
4 classes have been introduced: `Graph`, `Node`, `Value`, `Type`
- `Type` represents the kind of a `Value`
- `Value` represents a single symbolic value, it could be any kind that exists in `Type`. Values are inputs and outputs of a `Node`.
- `Node` represents a single unit of execution, typically a PyTorch op.
- `Graph` represents a model's computation graph, which is designed to facilitate transformation/analysis.
Differential Revision: [D75495273](https://our.internmc.facebook.com/intern/diff/D75495273/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154532
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #154530, #154531
Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner.
Test Plan: CI
Bifferential Revision: D75756619
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154850
Approved by: https://github.com/janeyx99
This is the start of a series of efforts to consolidating auxiliary threads in PGNCCL, aka watchdog and heartbeat_monitoring threads. Right now we launch these two threads per PG instances, i.e., if users create hundred or thousand instances of PG or subPGs, we will end up with that twice many side threads which is not efficient. We have a RFC to consolidate them (https://github.com/pytorch/pytorch/issues/146956). Right now both threads are assigned with so many functionalities so it is hard to do the consolidations in one shot, we will try to split it into at least two steps (PRs) to make it easier to test and review.
We did our first attemp in https://github.com/pytorch/pytorch/pull/153668 but we also want to try to see if we can make monitoring thread a class. This PR is doing the first step to make monitoring thread a class. The next step to also extract watchdog to be a separate class so that we know its dependency.
What we did in this PR:
1. Move all related variables and methods into a class named `HeartbeatMonitor`.
2. Correct some errors in the original logics inside monitoring thread loop.
3. Move the error propagation check to watchdog thread which is more relevant. This is totally fine since we rolled out EventCache out fully so watchdog hang is rare now.
Today there are two major functions inside heartbeat monitoring thread today:
1. Check the heartbeat of watchdog thread every 8 minutes. If no heartbeat detected and we are sure monitoring thread has not been stopped, we will kill the program by SIG_ABORT.
2. We check TCPStore every 30 sec to see if any watchdog timeout happens on other ranks, if so we will initiate a dump signal on the current rank as well. (We do this only in the default PG)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153977
Approved by: https://github.com/kwen2501, https://github.com/d4l3k
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
Added an in-memory representation for input and output specs of a graph. The GraphSignature class models the input and output specs of an exported graph produced by torch.export, which holds the graph information deserialized from the pt2 archive package. Runtime relies on the GraphSignature for weight name lookup and weight loading.
The serialization schema is defined in torch/_export/serde/schema.py
See more at: https://docs.pytorch.org/docs/stable/export.html#torch.export.ExportGraphSignature
Test Plan: Added tests under `test/cpp/nativert/test_graph_signature.cpp`
Differential Revision: D73895378
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152969
Approved by: https://github.com/swolchok
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
This diff moves `TensorMeta.cpp` and `TensorMeta.h` to PyTorch core under `torch/nativert/graph/`
Existing `torch::_export::TensorMeta` in `torch/csrc/utils/generated_serialization_types.h` is auto-generated from the export serde schema and therefore only containing the most basic serializable types. We need the newly added `TensorMeta.cpp` to deserialize the metadata into a in-memory class with c10 types so that it can be consumed by the runtime later.
Test Plan:
Added test under `test/cpp/nativert/test_tensor_meta.cpp`
Differential Revision: D73820548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152475
Approved by: https://github.com/albanD
This is my suggestion for resolving #152087
This PR extends the constructor of `AOTIModelPackageLoader` with an (optional) device index. The device type is still determined by `metadata_["AOTI_DEVICE_KEY"]`, but the `device_index` argument can be used to move an AOTI model package to different devices like `cuda:0`, `cuda:1`, ... in a convenient way. AFAIK, this is not possible so far using `AOTIModelPackageLoader` alone. The default case (no device index specified) with `metadata_["AOTI_DEVICE_KEY"] == "cuda"` would lead to the current behavior, i.e., the model is loaded to device `cuda`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152093
Approved by: https://github.com/desertfire