Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends).
Notable additions:
- A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config).
- Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers.
- More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions).
- New unit tests to cover the added AOT Inductor + `torch.cond` functionality.
Codegen examples from the new unit tests:
- [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e)
- [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb)
- [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6)
- [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223)
- [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711)
- [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12)
- [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690)
Test Plan:
```
$ python test/inductor/test_aot_inductor.py -k test_cond
...
----------------------------------------------------------------------
Ran 42 tests in 170.619s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120
Approved by: https://github.com/jansel, https://github.com/chenyang78
This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076
Approved by: https://github.com/malfet
# Update Profiler API to collect Execution Traces
## TLDR
We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware.
```
import torch
def main():
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
…
excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW
) as prof:
...
prof.step()
```
See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API.
## What are Execution Traces?
[Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads. It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies.
- Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too.
- At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki)
Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)]
## Why correlate Execution Trace with PyTorch/Kineto Trace
Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly.
Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths.
## Proposal
The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section
# Testing
Updated the unit test for collecting kineto and Execution Trace together.
- Check the collected ET has right range of events.
- Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference.
```
pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto -rP
Running 1 items in this shard
test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[W execution_trace_observer.cpp:694] Disabling Execution Trace Observer
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection
STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
- Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm
- Include files more granularly to avoid namespace pollution and circular imports
limitations:
- requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness
- will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)
- can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection
- tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681
Approved by: https://github.com/jansel
RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive.
This PR is to support primitive (or its container) arguments in RECORD_FUNCTION.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949
Approved by: https://github.com/soulitzer
**Summary:**
This commit simplifies the existing decomposition hierarchy
of batch norm ops by adding a single, backend agnostic op:
`batch_norm_with_update`. The existing hierarchy looks like:
```
aten.batch_norm ->
aten._batch_norm_impl_index ->
[
aten.native_batch_norm ->
aten._native_batch_norm_legit (export only) ->
_batch_norm_legit_cpu/cuda (kernels, export only) ->
_batch_norm_cpu/cuda (kernels)
] OR
[ aten.cudnn_batch_norm ] OR
[ aten.miopen_batch_norm ]
```
Aside from complexity, an important problem with the
above decomposition hierarchy is cuda numerics in
export flows. We observed significantly worse convergence
when training a mobilenetv2-like model when using the
`_batch_norm_cuda` kernel instead of the `cudnn_batch_norm`
kernel. This means users who export their models on CPU
first then move the models to cuda later may silently
see worse accuracies even when cudnn is installed,
because they are using the worse kernel. This issue is
summarized in https://github.com/pytorch/pytorch/issues/111384.
Instead, the new hierarchy proposed by consolidating
existing batch norm ops will look like:
```
aten.batch_norm ->
aten.batch_norm_with_update ->
[ _batch_norm_cpu (kernel) ] OR
[ _batch_norm_cuda (kernel) ] OR
[ cudnn_batch_norm (kernel) ] OR
[ miopen_batch_norm (kernel) ]
```
The new op `batch_norm_with_update` hides backend
implementation details and automatically picks the right
kernel based on what is installed. This commit also adds
the following variants to this op:
```
batch_norm_with_update_functional
batch_norm_with_update.out
batch_norm_no_update
batch_norm_no_update.out
batch_norm_backward
```
Note that this commit only adds this op and its variants,
but does not actually change the decomps to produce these
ops in the graph. This will be done after the 2 week FC
window, and the ops used in the old stack is planned to
be removed after the 6 month BC window.
Test Plan: `OpInfo` tests for `batch_norm_with_update`.
Reviewers: albanD, bdhirsh
Subscribers: albanD, bdhirsh, supriyar
Tasks: https://github.com/pytorch/pytorch/issues/111384
Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092
Approved by: https://github.com/bdhirsh, https://github.com/albanD
The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads.
I used Kineto traces to verify if the thread names were propagated:
<img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f">
Also:
```
nvidia-smi
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 3065920 C ...me#python#py_version_3_10 1968MiB |
| 1 N/A N/A 3065926 C ...me#python#py_version_3_10 1978MiB |
| 2 N/A N/A 3065930 C ...me#python#py_version_3_10 2084MiB |
| 3 N/A N/A 3065936 C ...me#python#py_version_3_10 2016MiB |
| 4 N/A N/A 3065939 C ...me#python#py_version_3_10 1998MiB |
| 5 N/A N/A 3065943 C ...me#python#py_version_3_10 2070MiB |
| 6 N/A N/A 3065948 C ...me#python#py_version_3_10 2026MiB |
| 7 N/A N/A 3065952 C ...me#python#py_version_3_10 2070MiB |
+-----------------------------------------------------------------------------+
[me@myhost ~]$ ps -T -p 3065920
PID SPID TTY TIME CMD
3065920 3065920 pts/14 00:01:04 pt_main_thread
...
3065920 3092181 pts/14 00:00:40 pt_autograd_d0
3065920 3092182 pts/14 00:00:00 pt_autograd_d1
3065920 3092183 pts/14 00:00:00 pt_autograd_d2
3065920 3092184 pts/14 00:00:00 pt_autograd_d3
3065920 3092185 pts/14 00:00:00 pt_autograd_d4
3065920 3092186 pts/14 00:00:00 pt_autograd_d5
3065920 3092187 pts/14 00:00:00 pt_autograd_d6
3065920 3092188 pts/14 00:00:00 pt_autograd_d7
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170
Approved by: https://github.com/albanD
Closes #120988
Currently operators that hit the autograd fallback call `check_inplace`
on all mutated inputs, including out arguments. This leads to a slightly
confusing error message:
```
RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.
```
Compared to functions that don't fallback, which raise
```
RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad.
```
This changes the error message to make clear the issue is with the out argument,
but does not tighten the check to outright ban out arguments that require grad.
Instead, I use the same checks from `check_inplace` which allows non-leaf tensors
that require grad to pass without error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089
Approved by: https://github.com/lezcano, https://github.com/soulitzer
ghstack dependencies: #121142
Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as
* Use plain C data types to pass parameters
* Use AtenTensorHandle to pass at::Tensor
* Use pointer type to pass optional parameter
* Use pointer+length to pass list
* Use device_type+device_index to pass device
* When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values
https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis.
This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage.
Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513
Approved by: https://github.com/jansel
Summary:
When there are multiple PGs in a process and a hardware failure happens,
we found that multiple PGs/ threads in the same
process are competing to dump the same records at the same time. The
affects the reliability of dumps.
In this PR, we will try to make the change such that only one thread/PG
could dump: PG0's monitor thread. We use a static variable to indicate
that something (e.g., collective timeout) has triggered the dump
locally.
monitor thread would dump debug info under any one of the 3 conditions:
1: this static variable is set to true by the watchdog thread when it detects
a timeout or pipe dump signal
2: timeout signal is received from other ranks through tcpstore
3: no heartbeat of watchdog
Test Plan:
python test/distributed/test_c10d_nccl.py -k
test_timeout_dumps_on_stuck_ranks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893
Approved by: https://github.com/wconstab
In particular this ensures we release the GIL when serializing:
- PyBytes objects (this is how we get the pickle object)
- Storage objects
Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818
Approved by: https://github.com/colesbury
VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way.
Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732
Approved by: https://github.com/jansel
Summary:
https://github.com/pytorch/pytorch/pull/104373 introduced backend_id
> an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object.
However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution.
This PR change the ID information exposted in record_param_comms from backend_id to pg_id.
Differential Revision: D53558257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475
Approved by: https://github.com/aaronenyeshi
Fix and test issues with both coalesced and individual send/recv ops
Considered an alternate approach and then ditched it
- alternate approach: #119757
- reason ditched: prefer recording individual collective events inside
coalescing region instead of just the event at the end of the region,
which also would not have tensor sizes or opnames without additional
state variables added
Another approach also ditched
- record events on workEnqueue instead of initWork
- reason ditched: too messy to get input/output shapes tagged on
recording when recording in workEnqueue. Adding the info onto the
Work obj would be possible, but adds to overhead of copying Works
which we do on every collective. We can get info off the input/output
tensors directly in initWork, but we don't want to keep refs to those
tensors alive while the work is Enqueued, so we'd have to specifically
copy size lists or something.
This PR instead avoids creating a work inside pointToPoint when
coalescing is active. Instead, only at endCoalescing() is a work finally
intialized and enqueued. But it adds a record() call inside
pointToPoint() instead of creating a work, during coalescing. This
record() call picks up tensor shapes and op names.
It ALSO changes initWork to accept a 'record' argument. This defaults to
false, and should only be set to true if the caller ensures the work
will be enqueued by workEnqueue, ensuring its cuda events are live when
used by flight recorder's update_state().
The testing uncovers some odd pre-existing behavior and leaves them
alone for now. We could change some of these
- seq starts off at 1, not 0 for first op (but this is inconistent)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #120724
In cases where sequence number is shared between events (e.g. coalesced
collectives) we want to ensure a unique (and ordered) ID per record.
Note: the records are already in a list, so their ID could be implicitly
observed. But (1) it's a ring buffer, so absolute ID is lost once the
buffer rolls over once, (2) users may sort or process or filter their
flight records, so having the ID be an explicit member of an entry is
still useful
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724
Approved by: https://github.com/zdevito
This adds support for backwards hooks that are *both*:
1) Interior to the graph; and
2) Dynamically generated (e.g. lambdas)
We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo *after* the forwards runs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382
Approved by: https://github.com/xmfan
This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend.
# Motivation
The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats.
# Principles
1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts.
2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths.
3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways.
# Solutions
### a. Pathway Identification:
Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling.
### b. `use_device` Logic Revision:
With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction.
### c. Kernel List Segregation:
To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects.
### d. Formatted Output:
To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data.
# Additional Enhancements
### a. Enumerations in `.pyi` Files:
Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU.
### b. Correct DeviceType Returns:
Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`.
### c. Bug Fixes in `cuda_corr_map`:
Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee.
# Further Abstraction
Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`.
# Next Pull Request
The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices.
We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185
Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui
Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely.
Test Plan:
1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq
2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest
Differential Revision: D54180050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596
Approved by: https://github.com/kwen2501
Fixes#115331.
This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:
- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`
[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn
1) Using items stored in torch._tensor_classes to check item passed from python side;
2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new;
3) Using more general API to get python module name in get_storage_obj and get_name functions.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263
Approved by: https://github.com/ezyang
Summary:
The current dump timeout logic is a bit cumbersome as it needs 2 times: 1.
timeout, 2. wake up time. And in theory the caller just needs to wait
for a max of timeout value for the dump and declare the dump to be
either successful or not. Also we unify the async call using std::async
instead of a customized async lauch function for each operation.
Test Plan:
Unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331
Approved by: https://github.com/wconstab
Summary:
Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress
but log only when there is timeout detected.
We found it is not enough since the 'straggler' itself might not detect
the timeout and hence there is no log from the 'straggler'.
In this PR, we can log these states periorically so that it would be
much easier for us to identify the straggler by checking which rank
has the smallest number of lastEnqueuedSeq_
Test Plan:
Log adding, build success
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438
Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501
Summary:
While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does.
Also marking a test affected by this with `@run_with_both_funcol_impls`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042
Approved by: https://github.com/wanchaol
Fixes#112389
## About
PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer.
- Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148
- However, the above needs the dynamic linking to libcupti.so to have taken place.
- I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389

## Workaround
We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue.
## Testing
Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py)
First export the daemon env variable
### Without any delay
```
>$ python3 linear_model_example.py
INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
99 1385.468505859375
```
### With 5 seconds delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py
cpu
99 284.82305908203125
10099 8.817167282104492
INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly = 1
ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024)
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
20099 8.817167282104492
```
### With an invalid delay
```
>$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py
INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
cpu
```
### Unit test updated as well.
## Impact
This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276
Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi
Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION.
Test Plan:
unit test
buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture
Differential Revision: D53674768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017
Approved by: https://github.com/soulitzer
Fixes#115331.
This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary:
- `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`.
- Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`.
- Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this.
- Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS`
[^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639
Approved by: https://github.com/cyyever, https://github.com/albanD
Summary:
This PR is mainly about flight recorder side of changes that takes a
map of maps as input, and dump it as picklable. Also add functions that
should be compiled only when NCCL_COMM_DUMP is defined
Test Plan:
Integration tests with NCCL would be done later, here we only do the
c10d side of dump test, aka,NCCLTraceTest
Testing the dump function is a bit tricky as we don't have
existing C++ unit tests for them. So we still use the Python NCCLTraceTest with
the python binding of _dump_nccl_trace(), we manually fed the
dump_nccl_trace with a map of test info, and assert the pickle result and
print the converted python dict:
```
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python
test/distributed/test_c10d_nccl.py NCCLTraceTest
NCCL version 2.19.3+cuda12.0
[rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL
preparing to dump debug info.
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
{'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2':
'Value2', 'Key1': 'Value1'}}
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.NCCL version 2.19.3+cuda12.0
.
----------------------------------------------------------------------
Ran 8 tests in 95.761s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063
Approved by: https://github.com/wconstab
Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982
Approved by: https://github.com/wanchaol
### Summary
@LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol.
- Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_`
- Integrated with python functol broadcast and `AsyncCollectiveTensor`
- Implemented Inductor lowering. Verified correctness and buffer reuse behavior
- Validated dynamo traceability
- Validated AOTInductor compile-ability
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229
Approved by: https://github.com/wanchaol
ghstack dependencies: #119104
Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op.
Test Plan:
Testing this on fsdp+tp example model where cuda is not initialized before init_process_group.
Job without the fix keeps dynamically registering:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION
The following keeps looping:
[0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true
[0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1
[0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000
Job with the fix does not have this issue:
https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION
Reviewed By: minsii, kwen2501, xw285cornell
Differential Revision: D53770989
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052
Approved by: https://github.com/kwen2501
# Motivation
According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel.
# Design
In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below.
<p align="center">
<img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218">
</p>
# Additional Context
We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`.
Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR.
In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`.
The differences with CUDA:
only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD
ghstack dependencies: #117611, #117619, #117734
Summary:
1. Make sure folded constants generated internally doesn't get exposed.
2. Add runConstantFolding and related API calls
Test Plan:
```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=*PyTorchPredictorContainerTest.LoadAOTInductorModel*
```
The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`,
which would trigger runConstantFolding from predictor's module loading.
Reviewed By: SherlockNoMad
Differential Revision: D53718139
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823
Approved by: https://github.com/chenyang78
# Motivation
As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`.
# Design
`XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively.
# Additional Context
It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA.
lack of the below APIs:
- `torch.cuda.Event.ipc_handle`
- `CUDAEvent`'s constructor with `IpcEventHandle`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #117611, #117619
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.
```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
virtual ~ViewFunc() {}
/// Returns any SymInts in the saved state.
virtual std::vector<c10::SymInt> get_symints() const { return {}; }
/// Returns the number of SymInts in the saved state.
virtual size_t num_symints() const { return 0; }
/// Returns any tensors in the saved state.
virtual std::vector<at::Tensor> get_tensors() const { return {}; }
/// Returns the number of tensors in the saved state.
virtual size_t num_tensors() const { return 0; }
/// Reapplies the view on the given base using the saved state.
virtual at::Tensor operator()(const at::Tensor&) const = 0;
/// Returns a clone of this ViewFunc, optionally with the specified saved state.
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;
protected:
/// Sets the values of any SymInts in the saved state. The input vector size must
/// match the number of SymInts in the saved state (i.e. the size of the list
/// returned by get_symints()).
virtual void set_symints(std::vector<c10::SymInt>) {}
/// Sets the values of any Tensors in the saved state. The input vector size must
/// match the number of Tensors in the saved state (i.e. the size of the list
/// returned by get_tensors()).
virtual void set_tensors(std::vector<at::Tensor>) {}
};
```
New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`
The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.
Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
{};
virtual ~SliceTensorViewFunc() override {};
virtual std::vector<c10::SymInt> get_symints() const override;
virtual size_t num_symints() const override;
virtual std::vector<at::Tensor> get_tensors() const override;
virtual size_t num_tensors() const override;
virtual at::Tensor operator()(const at::Tensor&) const override;
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;
protected:
virtual void set_symints(std::vector<c10::SymInt>) override;
virtual void set_tensors(std::vector<at::Tensor>) override;
private:
int64_t dim;
c10::optional<c10::SymInt> start;
c10::optional<c10::SymInt> end;
c10::SymInt step;
};
...
// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
::std::vector<c10::SymInt> symints;
symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
if(start.has_value()) symints.insert(symints.end(), *(start));
if(end.has_value()) symints.insert(symints.end(), *(end));
symints.push_back(step);
return symints;
}
size_t SliceTensorViewFunc::num_symints() const {
return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}
void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
auto i = 0;
if(start.has_value()) start = symints[i];
i += (start.has_value() ? 1 : 0);
if(end.has_value()) end = symints[i];
i += (end.has_value() ? 1 : 0);
step = symints[i];
}
std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
::std::vector<at::Tensor> tensors;
return tensors;
}
size_t SliceTensorViewFunc::num_tensors() const {
return static_cast<size_t>(0);
}
void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());
}
at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}
std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
std::optional<std::vector<c10::SymInt>> symints,
std::optional<std::vector<at::Tensor>> tensors) const {
auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
if (symints.has_value()) {
output->set_symints(std::move(*(symints)));
}
if (tensors.has_value()) {
output->set_tensors(std::move(*(tensors)));
}
return output;
}
```
The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.
For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
Summary:
Previously, we just store the char pointer in entry, the string is a
temp object and will be destructed when we want to dump/access it.
A quick fix is to store a copy of the string, but without changing the
upstream char*.
An alternative is to change every profilingTitle into std:string, this
however would needs comprehensive overhall of the code up to the
c10d::work layer above workNCCL and RecordFunction etc.
We chose the first option for this change
Resolve#119808
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837
Approved by: https://github.com/zdevito, https://github.com/wconstab
Summary:
auto& entry = entries_.at(*id % max_entries_);
entry = entries_.at(*id % max_entries_);
The above line of code has unintended consequence of invoking copy/assignment
of entry objects as ref itself cannot be re-assigned.
Also what could cause the crash is that the entry ref could become invalid if entries_ are
resized by other threads. and this could result in 'copy to a garbage
location'. The fix is to use a pointer which can be re-assigned after
re-acquiring the lock
Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748
Approved by: https://github.com/wconstab, https://github.com/fegin
Fixes#118990
The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`.
To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264
Approved by: https://github.com/eellison
Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc.
The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state.
This was bad for a few reasons:
- Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this.
- It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph.
This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface.
As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week).
Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531
Approved by: https://github.com/zhxchen17
Summary:
This PR tries to resolve issue #119215.
Basically, processgroup shutdown (and hence ncclCommAbort) is called in
destroy_process_group APIs for the corresponding PGs. and in the
destructor of ProcessGroup, we avoid calling abort/ncclCommAbort.
Instead, it just checks if the users have explicitly already called destroy_process_group. If
not, Destructor will log a warning and encourage/expect users to do so
for cleanup of resources of PGs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250
Approved by: https://github.com/minsii, https://github.com/kwen2501
# Motivation
According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`.
# Design
Currently, it primarily offers stream-related APIs, including
- `torch.xpu.StreamContext`
- `torch.xpu.current_stream`
- `torch.xpu.set_stream`
- `torch.xpu.synchronize`
- `torch._C._xpu_getCurrentRawStream`
# Additional Context
We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`.
The differences with CUDA:
no default and external stream in XPU and lack of below APIs:
- `torch.cuda.ExternalStream`
- `torch.cuda.default_stream`
- `toch.cuda.is_current_stream_capturing`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #117611
Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on.
```cpp
/// Base class for view functions, providing reapplication of a view on a new base.
/// Each view op should get a codegenerated subclass of this class containing
/// any state needed to reconstruct the view. The class also provides convenience
/// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification,
/// where we want to use symbolic values or fake tensors instead.
struct TORCH_API ViewFunc {
virtual ~ViewFunc() {}
/// Returns any SymInts in the saved state.
virtual std::vector<c10::SymInt> get_symints() const { return {}; }
/// Returns the number of SymInts in the saved state.
virtual size_t num_symints() const { return 0; }
/// Returns any tensors in the saved state.
virtual std::vector<at::Tensor> get_tensors() const { return {}; }
/// Returns the number of tensors in the saved state.
virtual size_t num_tensors() const { return 0; }
/// Reapplies the view on the given base using the saved state.
virtual at::Tensor operator()(const at::Tensor&) const = 0;
/// Returns a clone of this ViewFunc, optionally with the specified saved state.
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0;
protected:
/// Sets the values of any SymInts in the saved state. The input vector size must
/// match the number of SymInts in the saved state (i.e. the size of the list
/// returned by get_symints()).
virtual void set_symints(std::vector<c10::SymInt>) {}
/// Sets the values of any Tensors in the saved state. The input vector size must
/// match the number of Tensors in the saved state (i.e. the size of the list
/// returned by get_tensors()).
virtual void set_tensors(std::vector<at::Tensor>) {}
};
```
New codegen files:
* `torch/csrc/autograd/generated/ViewFunc.h`
* `torch/csrc/autograd/generated/ViewFuncs.cpp`
The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd.
Example codegen for `slice.Tensor`:
```cpp
// torch/csrc/autograd/generated/ViewFuncs.h
#define SLICE_TENSOR_VIEW_FUNC_AVAILABLE
struct SliceTensorViewFunc : public torch::autograd::ViewFunc {
SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step)
{};
virtual ~SliceTensorViewFunc() override {};
virtual std::vector<c10::SymInt> get_symints() const override;
virtual size_t num_symints() const override;
virtual std::vector<at::Tensor> get_tensors() const override;
virtual size_t num_tensors() const override;
virtual at::Tensor operator()(const at::Tensor&) const override;
virtual std::unique_ptr<ViewFunc> clone_and_set(
std::optional<std::vector<c10::SymInt>> = c10::nullopt,
std::optional<std::vector<at::Tensor>> = c10::nullopt) const override;
protected:
virtual void set_symints(std::vector<c10::SymInt>) override;
virtual void set_tensors(std::vector<at::Tensor>) override;
private:
int64_t dim;
c10::optional<c10::SymInt> start;
c10::optional<c10::SymInt> end;
c10::SymInt step;
};
...
// torch/csrc/autograd/generated/ViewFuncs.cpp
std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const {
::std::vector<c10::SymInt> symints;
symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
if(start.has_value()) symints.insert(symints.end(), *(start));
if(end.has_value()) symints.insert(symints.end(), *(end));
symints.push_back(step);
return symints;
}
size_t SliceTensorViewFunc::num_symints() const {
return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1);
}
void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) {
TORCH_INTERNAL_ASSERT(symints.size() == num_symints());
auto i = 0;
if(start.has_value()) start = symints[i];
i += (start.has_value() ? 1 : 0);
if(end.has_value()) end = symints[i];
i += (end.has_value() ? 1 : 0);
step = symints[i];
}
std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const {
::std::vector<at::Tensor> tensors;
return tensors;
}
size_t SliceTensorViewFunc::num_tensors() const {
return static_cast<size_t>(0);
}
void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) {
TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors());
}
at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const {
return at::_ops::slice_Tensor::call(input_base, dim, start, end, step);
}
std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set(
std::optional<std::vector<c10::SymInt>> symints,
std::optional<std::vector<at::Tensor>> tensors) const {
auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step);
if (symints.has_value()) {
output->set_symints(std::move(*(symints)));
}
if (tensors.has_value()) {
output->set_tensors(std::move(*(tensors)));
}
return output;
}
```
The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification.
For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly.
```sh
python test/test_autograd.py -k test_view_func_replay
python test/test_ops.py -k test_view_replay
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404
Approved by: https://github.com/ezyang
In some cases where we have TORCH_CHECK in loops, it may cause the host
compiler to spend hours optimizing the run_impl function. This PR
mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK,
where we force the underneath assert function to be noinline.
If forcing noinline caused any serious perf regression, we could
either add an option to turn on/off enable noinline. Or, we could
another an option to just turn AOTI_CHECK into a no-op, similar
to the ```assert``` macro from cassert.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220
Approved by: https://github.com/hl475, https://github.com/desertfire
Summary:
With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.
This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.
Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.
Test Plan:
Unit Test
buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2
Differential Revision: D53533253
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398
Approved by: https://github.com/aaronenyeshi
Some APIs like ncclCommAbort can cause nccl kernels to finish even if
they were previously stuck. Because we can gather the trace buffer after
those calls, we can end up seeing some collectives marked completed eventhough
that complete happened several minutes after they started and clearly after
the timeout. This changes how we record state so that we keep track of the time
we discover a state change, so even if eventually the collective gets marked complete,
we can observe it happened minutes after it was schedule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249
Approved by: https://github.com/wconstab
Everything inside the `AT_DISPATCH` block is being compiled 5 times,
so it makes sense to limit it to the only line that uses `scalar_t` which is
the `numeric_limits` query.
Also a small optimization, instead of computing `grad.log()` and `(-grad).log()`
we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397
Approved by: https://github.com/lezcano, https://github.com/albanD
**Summary**
The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor.
**Key Logic**
1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters.
2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`.
3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter.
**Bucketing**
The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces.
The bucketing is done in a separate PR.
Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662
Approved by: https://github.com/wconstab
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization.
# Design
This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability.
# Additional Context
We adopt a similar design to CUDA. So we share some code with CUDA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
ghstack dependencies: #119248
Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090.
Summary of changes:
- ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor)
- Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated.
- ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor)
- CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free.
- code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references.
- Added tests that check for memory leaks and cache deletion operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107
Approved by: https://github.com/jansel
Summary:
With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex.
This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex.
Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern.
Test Plan:
Unit Test
buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2
Differential Revision: D53299183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242
Approved by: https://github.com/aaronenyeshi
Fixes https://github.com/pytorch/pytorch/issues/117361
The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one.
This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds.
The infra pieces of this PR are:
* Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv
* When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`.
* Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way.
The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises.
As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.)
When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579
Approved by: https://github.com/eellison, https://github.com/lezcano
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.
### This PR
We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following:
- Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now.
- By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`.
- The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #118910
Fixes#112389 and https://github.com/facebookincubator/dynolog/issues/208
This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch.
* When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run).
* Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own.
* Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set.
Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch.
The change in PyTorch is to correctly set the `cpuOnly` argument.
## TestPlan:
Build PyTorch from source with USE_CUDA=0 so we have CPU only based build. Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79`
(See instructions in PyTorch repo)
For the setup we run dynolog daemon in another terminal
```
buck2 run dynolog/src:dynolog -- --enable_ipc_monitor &
```
Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'.
```
export KINETO_USE_DAEMON=1
python linear_model_example.py
```
Output shows the profiler registration with dynolog
```
(pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py
INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly = 1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1
```
We can also collect a trace using
```
[bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json
Kineto config =
ACTIVITIES_LOG_FILE=/tmp/test.json
PROFILE_START_TIME=0
ACTIVITIES_DURATION_MSECS=500
PROFILE_REPORT_INPUT_SHAPES=false
PROFILE_PROFILE_MEMORY=false
PROFILE_WITH_STACK=false
PROFILE_WITH_FLOPS=false
PROFILE_WITH_MODULES=false
response length = 147
response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]}
Matched 1 processes
Trace output files will be written to:
/tmp/test_1807792.json
```
And trace file contains the trace correctly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320
Approved by: https://github.com/aaronenyeshi
### Motivation
Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)) and [this](b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272))). For native funcol I ran into the same issues but I'd rather just fix the coverage.
**I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage**.
### This PR
This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`. This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910
Approved by: https://github.com/shuqiangzhang
# Motivation
This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability.
# Design
We maintain a flag for each backend to manage the lazy initialization state separately.
# Additional Context
No need more UTs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846
Approved by: https://github.com/malfet
# Motivation
According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`.
# Design
This PR primarily offers device-related APIs in python frontend, including
- `torch.xpu.is_available`
- `torch.xpu.device_count`
- `torch.xpu.current_device`
- `torch.xpu.set_device`
- `torch.xpu.device`
- `torch.xpu.device_of`
- `torch.xpu.get_device_name`
- `torch.xpu.get_device_capability`
- `torch.xpu.get_device_properties`
- ====================
- `torch.xpu._DeviceGuard`
- `torch.xpu._is_compiled`
- `torch.xpu._get_device`
# Additional Context
We will implement the support of lazy initialization in the next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850
Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet
Summary:
There is an annoying inconsistency in how we pickle custom objs.
`torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u.
This serializes in a different format than TorchScript does, which uses the TS C++ pickler.
The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work.
Test Plan:
ran SherlockNoMad's repro
```
buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG
```
Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs.
Reviewed By: SherlockNoMad
Differential Revision: D53248454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791
Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov
Summary:
Add Runtime Constant-folding for AOTInductor.
This also include the invocation of constant folding at load time.
The constant folding lowering is a 2-step process.
First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code.
Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module.
Test Plan: Unit tests included in commit.
Differential Revision: D53274382
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765
Approved by: https://github.com/chenyang78
Summary:
The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds.
This change ensures that all registers_ accesses are in bounds or an exception will be thrown.
Test Plan: contbuild + OSS signals
Differential Revision: D49748737
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300
Approved by: https://github.com/malfet, https://github.com/kimishpatel
Summary:
During debugging of some timeouted jobs, I found it difficult to
identify which rank is at fault eventhough we have logs of many ranks
reporting timeout on a specific collective seq.
If we can also report lastEqueuedSeq and lastCompletedSeq, it would be
much easier to identify,
1. whether a rank has not even join a collective call (not enqueued)
2. Or it has joined the collective call, but not completed.
For the 1st case, it is mostly likely users code problem
for the 2ed case, it could be lower-layer issues
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582
Approved by: https://github.com/wconstab
### Summary
Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops.
The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426
Approved by: https://github.com/wanchaol
ghstack dependencies: #113057
resolve#117749
Summary:
Updated the PR with the following intentions:
1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled.
2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call.
3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call.
4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256
Approved by: https://github.com/kwen2501
This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification.
NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057
Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol
resolve https://github.com/pytorch/pytorch/issues/117749
Summary:
This is the first step to enable NCCL nonblocking mode.
In NCCL nonblocking mode, ncclInProgress is an expected return value
when checking communicators. Without this relaxation, watchdog thread
would throw NCCL errors during work checking while it is expected.
Test Plan:
Set nonblocking mode in unit tests, and make sure all existing NCCL
tests pass
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254
Approved by: https://github.com/kwen2501
This is a lot of files changed! Don't panic! Here's how it works:
* Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file.
* When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded.
* The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors.
* Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list.
* Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves.
* torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state.
* There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many.
In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file.
The codemod was done with this script authored by GPT-4:
```
import glob
exclude_patterns = [
...
]
for pattern in exclude_patterns:
for filepath in glob.glob(pattern, recursive=True):
if filepath.endswith('.py'):
with open(filepath, 'r+') as f:
content = f.read()
f.seek(0, 0)
f.write('# mypy: ignore-errors\n\n' + content)
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414
Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD
Relying on object lifetimes in Python is a bad idea due to reference
cycles. Previously, when a torch.library.Library object gets destroyed,
it clears all the registrations associated with it, but it's unclear
when it actually gets destroyed due to the existence of refcycles.
This PR:
- adds torch::Library::clear(), which deterministically releases all of
the RAII registration handles of the torch::Library object
- adds a new `torch.library._scoped_library` context manager, which creates
a library and cleans it up at the end of the scope using the previous item.
All tests (unless they already handle library lifetimes) should use
this new API
- Rewrites some flaky tests to use `_scoped_library`.
In the future we'll probably migrate all of our torch.library tests to
use `_scoped_library`, but that's kind of annoying because we have
multiple thousands of LOC
I'm hoping this will deflake those tests; we'll see.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318
Approved by: https://github.com/albanD
This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable.
Repro 1:
```python
import torch
def f(x):
return x[x > 0]
jf = torch.jit.trace(f, torch.tensor(2., device="cuda"))
```
Error:
```bash
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace
traced = torch._C._create_function_from_trace(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<stdin>", line 2, in f
RuntimeError: NYI: Named tensors are not supported with the tracer
```
Repro2:
```python
import torch
import torch.nn.functional as F
from torch import nn
import copy
class Net(nn.Module):
def __init__(self):
super().__init__()
def forward(self, inputs):
x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer
x = F.relu(x)
return x
model = Net()
images = torch.randn(8, 28, 28)
torch.jit.trace(model, images)
```
Error 2:
```bash
Traceback (most recent call last):
File "/opt/pytorch/test_deepcopy.py", line 18, in <module>
File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace
return trace_module(
^^^^^^^^^^^^^
File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module
module._c._create_method_from_trace(
File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward
result = self.forward(*input, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/test_deepcopy.py", line 12, in forward
x = F.relu(x)
^^^^^^^^^^
File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
y = copier(memo)
^^^^^^^^^^^^
File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__
new_storage = self._typed_storage()._deepcopy(memo)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy
return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy
y = copier(memo)
^^^^^^^^^^^^
File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__
new_storage = self.clone()
^^^^^^^^^^^^
File "/opt/pytorch/torch/storage.py", line 126, in clone
return type(self)(self.nbytes(), device=self.device).copy_(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: NYI: Named tensors are not supported with the tracer
```
----
#48054 RuntimeError: NYI: Named tensors are not supported with the tracer
#49538 jit tracer doesn't work with unflatten layer
#31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer.
- This bug was closed but exists. Multiple comments on it still showing error. This is addressed
Likely fixes the following issues (but untested)
#63297 Named tensor in tracer
#2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx
Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with
"NYI: Named tensors are not supported with the tracer"
This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py
Test plan:
New unit test added
Broken scenarios tested locally
CI
Fixes#48054
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393
Approved by: https://github.com/zou3519
# Summary
Simplification of Backend Selection
This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager.
For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations.
Problems:
- This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend.
- This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend.
- Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful.
Other concerns:
- Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends).
A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689
Approved by: https://github.com/cpuhrsch
This PR intends to fix the following issue when swapping two tensors
```python
>>> import torch
>>> torch.manual_seed(5)
>>> t1 = torch.randn(2)
>>> t2 = torch.randn(3)
>>> t1
tensor([-0.4868, -0.6038])
>>> t2
tensor([-0.5581, 0.6675, -0.1974])
>>> torch.utils.swap_tensors(t1, t2)
>>> t1
tensor([-0.5581, 0.6675, -0.1974])
>>> t2
tensor([-0.4868, -0.6038])
>>> t1.fill_(0.5) # t1 back to its unswapped state :o
tensor([-0.4868, -0.6038])
```
What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned.
57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)
When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead.
The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955
Approved by: https://github.com/albanD
At lest if one tries to compile the AOTI code on Darwin, compilation
fails with implicit instantiation of undefined template error:
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3:
/Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>'
std::stringstream ss;
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075
Approved by: https://github.com/desertfire
ghstack dependencies: #118074
Summary:
Previously, heatbeat was incremented once per finishing a for loop over a list
of in-progress work items, under the assumption that either the processing
would be predictably quick, or it would hang completely.
In fact, there can be cuda API contention that causes the processing of works
to slow down arbitrarily but not truly deadlock. To guard against this, we
bump the heartbeat at the smallest unit of progress, one work item being
successfully processed.
Test Plan: CI
Differential Revision: D52973948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016
Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501
Summary:
`conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we
- created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp`
- registered them in `Register.cpp`
- rewrote the graph representation of the op in `vulkan_rewrite.cpp`
Test Plan:
## Numerical test
```
[luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="*conv1d*"
Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef
Network: Up: 0B Down: 0B
Jobs completed: 4. Time elapsed: 0.6s.
BUILD SUCCEEDED
Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = *conv1d*
[==========] Running 2 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 2 tests from VulkanAPITest
[ RUN ] VulkanAPITest.conv1d_simple
[ OK ] VulkanAPITest.conv1d_simple (159 ms)
[ RUN ] VulkanAPITest.conv1d
[ OK ] VulkanAPITest.conv1d (57 ms)
[----------] 2 tests from VulkanAPITest (217 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test suite ran. (217 ms total)
[ PASSED ] 2 tests.
```
Full test result in P1053644934, summary as below
```
[----------] 419 tests from VulkanAPITest (28080 ms total)
[----------] Global test environment tear-down
[==========] 419 tests from 1 test suite ran. (28080 ms total)
[ PASSED ] 418 tests.
[ SKIPPED ] 1 test, listed below:
[ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log
```
## Graph representation comparison
We created a model using `conv1d` and traced it as below
```
# Define a simple model that uses conv1d
class MyModel(torch.nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.conv1d = nn.Conv1d(16, 33, 3)
def forward(self, x):
return self.conv1d(x)
# Create an instance of the model
model = MyModel()
# Create a dummy input tensor for tracing
input_tensor = torch.randn(20, 16, 50)
# Use torch.jit.trace to trace the model and generate a graph
traced_model = torch.jit.trace(model, input_tensor)
```
Then we converted the traced model to Vulkan backend using `optimize_for_mobile`
```
from torch.utils import mobile_optimizer
vulkan_model = mobile_optimizer.optimize_for_mobile(
traced_model, backend="vulkan", preserved_methods=to_preserve
)
```
Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)`
- before this diff: `conv1d` was used
```
graph(%self.1 : __torch__.___torch_mangle_16.MyModel,
%x : Tensor):
%60 : Device = prim::Constant[value="cpu"]()
%self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
%37 : bool = prim::Constant[value=0]()
%36 : NoneType = prim::Constant()
%59 : Device = prim::Constant[value="vulkan"]()
%self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]()
%7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0
%18 : int[] = prim::Constant[value=[1]]()
%19 : int[] = prim::Constant[value=[0]]()
%39 : Tensor = aten::to(%x, %59, %36, %37, %37)
%20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7)
%58 : Tensor = aten::to(%20, %60, %36, %37, %37)
return (%58)
```
- after this diff: `conv1d` was replaced with `run_conv1d_context`
```
graph(%self.1 : __torch__.___torch_mangle_6.MyModel,
%x : Tensor):
%85 : Device = prim::Constant[value="cpu"]()
%51 : bool = prim::Constant[value=0]()
%50 : NoneType = prim::Constant()
%84 : Device = prim::Constant[value="vulkan"]()
%53 : Tensor = aten::to(%x, %84, %50, %51, %51)
%prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1)
%22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0)
%83 : Tensor = aten::to(%22, %85, %50, %51, %51)
return (%83)
```
Differential Revision: D52865379
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780
Approved by: https://github.com/yipjustin
Because ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167
Approved by: https://github.com/ezyang
Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed.
Take DDP and Ampere for example:
DDP's bucket size = 25 MB
Ampere's NVLink speed = 250 GB/s
25 MB / 250 GB/s = 100 ms.
So we are updating the interval to 100 ms.
Update:
25 MB / 250 GB/s = 0.1 ms
But let's see how it goes so far between making the checking more aggressive.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297
Approved by: https://github.com/fduwjj