Summary: Introduce a utility class AOTIModelRunner to take care of running an AOTInductor compiled model. It does things like dlopen a model, initialize the model container, setup inputs and outputs, and destroy the model container.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110891
Approved by: https://github.com/chenyang78
ghstack dependencies: #110652
Otherwise following error is thrown when attempted to compile with WERROR enabled:
```
In file included from /home/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:24: warning: redundant redeclaration of ‘constexpr’ static data member ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’ [-Wdeprecated]
340 | constexpr const size_t codecvt_result<CodeUnit>::max_size;
| ^~~~~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:335:33: note: previous declaration of ‘fmt::v10::detail::codecvt_result<CodeUnit>::max_size’
335 | static constexpr const size_t max_size = 32;
| ^~~~~~~~
```
or following if using clang as host compiler
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/distributed/c10d/socket.cpp:30:
/Users/nshulga/git/pytorch/pytorch/third_party/fmt/include/fmt/chrono.h:340:50: warning: out-of-line definition of constexpr static data member is redundant in C++17 and is deprecated [-Wdeprecated]
constexpr const size_t codecvt_result<CodeUnit>::max_size;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111002
Approved by: https://github.com/drisspg
Avoid changing default for other backends as CPU backend (GLOO) may need
longer timeouts.
Motivated by trying to save cluster time when encountering collective
hangs. Generally collectives should time out within seconds and 30
minutes (or 10 minutes) should provide ample headroom for edge cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110947
Approved by: https://github.com/xw285cornell, https://github.com/fduwjj
This reverts commit ff0358b038.
(original PR https://github.com/pytorch/pytorch/pull/108815 desc copied below)
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110907
Approved by: https://github.com/fduwjj
I tested by adding some warning logs in C++, run a distributed program and show that they now had `[rank0]:` in the messages. There is no existing test infra for C++ logging so I couldn't easily add a unit test.
The implementation strategy is to setup a global variable in C++, and then poke it when we initialize a process group. This was the simplest thing I could think of that would work.
This PR only works for non-glog logging. Probably need to come up with some other strategy for glog, e.g., a custom prefix, but need to make sure this doesn't conflict with fbcode. I can't easily test this from OSS, will leave as follow up work.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110623
Approved by: https://github.com/voznesenskym, https://github.com/wanchaol, https://github.com/fduwjj
Summary
Currently, when detecting a timeout/exception in the watchdog
workCleanupLoop, we call nccl APIs to abort all the active communicators
before finally re-raising the exception and killing the process. The
nccl APIs may hang, causing additional problems. Instead, just re-raise.
@kumpera proposed that changing this default should save us from a lot of commonly observed errors.
Note: there are other cuda/nccl api calls in our watchdog, which also could hang. This change is not a substitute for a deeper refactor.
Detail
The current default (NCCL_ASYNC_ERROR_HANDLING=1:TearDown) meant the following:
SHOULD_TEAR_DOWN() evaluates to true
- This affects 'ProcessGroupNCCL::WorkNCCL::handleException`
- handleException is called from two places:
- work.wait() -> synchronizeInternal() -> handleException()
- workCleanupLoop() -> handleException()
- when true, the excpetion is logged and rethrown
SHOULD_CLEAN_UP() evaluates to true
- This only impacts the workCleanupLoop()
- When true, it means all communicators will be aborted (ncclCommAbort())
upon work exception or timeout
The proposed new default is NCCL_ASYNC_ERROR_HANDLING3=3:SkipCleanUp.
This only changes SHOULD_CLEAN_UP() to false, impacting workCleanupLoop() behavior.
Communicators will no longer be aborted, which should avoid a class of bugs where the watchdog hangs due to calling nccl APIs which may block/hang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110723
Approved by: https://github.com/fduwjj, https://github.com/xw285cornell
Fixes the string_view errors and reland the work. The previous changes in torch/csrc/utils/invalid_arguments.cpp were too aggressive and not tested thoroughly. They are discarded.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110518
Approved by: https://github.com/ezyang
Summary: `torch.nonzero` doesn't have inductor lowering (yet). To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_zero_grid_with_unbacked_symbols
...
----------------------------------------------------------------------
Ran 4 tests in 78.650s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110766
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713, #110745, #110764
Summary: For unbacked SymInts, the C++ wrapper codegen can generate expressions like `buf123.size()` or `.stride()` or `.storage_offset()`:
7cc0020a80/torch/_inductor/ir.py (L2504-L2520)
Here we add corresponding methods to the `RAIIAtenTensorHandle` class so that the above codegen works in the ABI compatibility mode.
Test Plan: CI + the following PR
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110764
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713, #110745
Summary: `repeat_interleave.Tensor` doesn't have inductor lowering. To invoke the operator in AOT Inductor's ABI compatibility mode we need a dedicated shim function.
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_repeat_interleave
...
----------------------------------------------------------------------
Ran 4 tests in 70.526s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110745
Approved by: https://github.com/chenyang78
ghstack dependencies: #110713
Expose a set of observability hooks into C10D such that our users can
detect collectives failure both faster and more easily.
The design is similar to NCCL desync debug that it minimized the
overhead by doing most of the work out of the main thread.
This PR introduces a new module torch.distributed.hooks that exposes the following set of methods:
register_collective_start_hook
register_collective_end_hook
register_process_group_hook
The process group hook exposes PG creation on the member ranks and call them inline from the
the PG creation code. This is fine since this happens during initialization and a limited number of times.
The collective start/end hooks are fired from a single background thread. It reads
events from a C++ queue and dispatches over.
Queue notification is oddly done using a pipe, this is needed so python can abort the thread on shutdown
and have it as background thread. This is not possible with more reasonable choices like a condvar.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108815
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.
Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.
This change makes devices remembered at model level in AOT mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Fixes#106698
Also added a check for python API, because current error message
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sehoon/pytorch-latest/torch/nn/modules/adaptive.py", line 128, in __init__
or (min(cutoffs) <= 0) \
ValueError: min() arg is an empty sequence
```
is not very comprehensible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/106777
Approved by: https://github.com/albanD
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.
Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.
Design:
Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.
It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.
Test Plan: ci
Reviewed By: khabinov
Differential Revision: D49835430
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
Summary: Registering param/buffer will write into a vector inside Object, need to maintain thread safety if we have threads reading from the vector and writing to the vector at the same time.
Test Plan: CI
Differential Revision: D49882601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110488
Approved by: https://github.com/davidberard98
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.
Test Plan: sandcastle
Differential Revision: D49653265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito