pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Animesh Jain	79a04f2df9	[dynamo][guards-cpp-refactor] Permit dict version guard in DictGuardManager (#121327 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121327 Approved by: https://github.com/jansel	2024-03-08 01:24:00 +00:00
Adnan Akhundov	3d089de851	Add torch.cond support to AOT Inductor (#121120 ) Summary: In this PR, `torch.cond` support and the necessary codegening infrastructure is added to C++ wrapper (AOTInductor and friends). Notable additions: - A new mechanism in the Python wrapper codegen to precompile and save the Triton kernels (generated and user-defined) which haven't been covered by the active path through the control flow given the sample inputs. As we can't do the runtime autotuning of the kernels outside the active path, we precompile and save them with the `launchers[0]` (corresponding to the first config). - Codegen infra for `torch.cond` in the C++ wrapper (ABI- and non-ABI-compatible). The `torch.cond` codegen has been slightly refactored to avoid duplication across the Python and C++ wrappers. - More extensions of the caching sites in the wrapper code to cache per codegened graph (e.g., `codegen_int_array_var`) + some infra for tracking the current codegened graph in the wrapper (both during codegen-ing in the `Scheduler.codegen` and in the `WrapperCodeGen.generate` functions). - New unit tests to cover the added AOT Inductor + `torch.cond` functionality. Codegen examples from the new unit tests: - [`test_cond_simple_abi_compatible_cpu`](https://gist.github.com/aakhundov/862d5de9aa460f5df399e1387f7b342e) - [`test_cond_simple_abi_compatible_cuda`](https://gist.github.com/aakhundov/d70b81f95fa8cc768cedef9acacb25bb) - [`test_cond_simple_non_abi_compatible_cpu`](https://gist.github.com/aakhundov/c0ae7a8cbb6fa311c838e1b580f9a3f6) - [`test_cond_simple_non_abi_compatible_cuda`](https://gist.github.com/aakhundov/08b945d4e8a32c97b7f9ff6272f4a223) - [`test_cond_nested_abi_compatible_cuda`](https://gist.github.com/aakhundov/ce664f433c53e010ce4c0d96a6c13711) - [`test_cond_with_parameters_abi_compatible_cuda`](https://gist.github.com/aakhundov/77afbeb8eaab5c5b930a3f922a7baf12) - [`test_cond_with_multiple_outputs_abi_compatible_cuda`](https://gist.github.com/aakhundov/8cc06105ec8a3fe88be09b3f6e32c690) Test Plan: ``` $ python test/inductor/test_aot_inductor.py -k test_cond ... ---------------------------------------------------------------------- Ran 42 tests in 170.619s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121120 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-07 22:39:57 +00:00
Scott Wolchok	4c58f2b675	[PyTorch] Use uint32_t for ProcessedNode::num_outputs (#121335 ) We already use uint32_t for indexing, and the notion of a single graph node with more than four billion outputs stretches credulity. Differential Revision: [D54598821](https://our.internmc.facebook.com/intern/diff/D54598821/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121335 Approved by: https://github.com/Skylion007	2024-03-07 21:15:05 +00:00
PyTorch MergeBot	2b1661c7a0	Revert "[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 )" This reverts commit `05c256849b`. Reverted https://github.com/pytorch/pytorch/pull/120681 on behalf of https://github.com/izaitsevfb due to breaking internal builds, see D54617701 ([comment](https://github.com/pytorch/pytorch/pull/120681#issuecomment-1984214079))	2024-03-07 18:53:51 +00:00
Shengbao Zheng	60aaba4128	create function to get ProcessGroupNCCL uid (#121132 ) Summary: expose ProcessGroupNCCL uid Differential Revision: D54446056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132 Approved by: https://github.com/aaronenyeshi	2024-03-07 18:34:38 +00:00
Bin Bao	7e598c0053	[Inductor] Enable ABI-compatible mode for cpp-wrapper JIT (#121309 ) Differential Revision: [D54617284](https://our.internmc.facebook.com/intern/diff/D54617284) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121309 Approved by: https://github.com/chenyang78	2024-03-07 14:22:06 +00:00
cyy	4305c64fea	Change ATEN generator argument type to const std::optional<Generator>& (#120076 ) This PR proposes to use std::optional<Generator>& for underlying functions to avoid unnecessary copy and move operations. The torchgen code was changed to generate the new type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120076 Approved by: https://github.com/malfet	2024-03-07 09:52:21 +00:00
Chen_Liqing	291ce86a6c	Modify StorageImplCreateHelper (#118459 ) I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``: `bb6eba189f/torch/csrc/Storage.cpp (L525-L540)` Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459 Approved by: https://github.com/albanD	2024-03-07 06:26:55 +00:00
briancoutinho	b9087f8571	[profiler] Add execution_trace_observer as an optional argument to profiler (#119912 ) # Update Profiler API to collect Execution Traces ## TLDR We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware. ``` import torch def main(): with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], … excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW ) as prof: ... prof.step() ``` See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API. ## What are Execution Traces? [Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads. It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies. - Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too. - At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki) Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)] ## Why correlate Execution Trace with PyTorch/Kineto Trace Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly. Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths. ## Proposal The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section # Testing Updated the unit test for collecting kineto and Execution Trace together. - Check the collected ET has right range of events. - Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference. ``` pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto -rP Running 1 items in this shard test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [W execution_trace_observer.cpp:694] Disabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-03-07 01:30:26 +00:00
Denis Yaroshevskiy	b0e2ed4d67	removing some macros (#120314 ) Summary: Will be making some changes in the surrounding code, they are going to be easier without macros Differential Revision: D54001770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314 Approved by: https://github.com/zhxchen17	2024-03-06 22:06:05 +00:00
Tobias Ringwald	76f3663efe	Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156 ) …unsupported dtype. Fixes #121138. The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156 Approved by: https://github.com/soulitzer, https://github.com/albanD	2024-03-06 19:37:38 +00:00
Simon Fan	05c256849b	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-06 18:01:56 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
Animesh Jain	e3bd6efe72	[dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147, #121154	2024-03-06 08:36:45 +00:00
Animesh Jain	b6b2d5b00a	[dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147	2024-03-06 08:36:45 +00:00
Animesh Jain	52d89d8491	[dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147 Approved by: https://github.com/jansel ghstack dependencies: #121121	2024-03-06 08:36:45 +00:00
Animesh Jain	af7f55ffc8	[dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121 Approved by: https://github.com/jansel	2024-03-06 08:36:45 +00:00
PyTorch MergeBot	8087912622	Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 )" This reverts commit `0ab2ec3738`. Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))	2024-03-06 06:39:51 +00:00
Sheng Fu	31bfa59970	Capture primitive data type arguments for profiling python_function (#120949 ) RECORD_FUNCTION in python_function only captures argument that is a Tensor. However, it is very common for user to use non tensor arguments in custom ops, for example, sequence length in GPT attention custom op. My previous PR tries to capture all non-tensor arguments, it turned out in some cases, it is very expensive. This PR is to support primitive (or its container) arguments in RECORD_FUNCTION. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120949 Approved by: https://github.com/soulitzer	2024-03-06 05:09:22 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Valentin Andrei	8bb3e0b643	[pytorch] Name the main and autograd threads for better debugging (#121170 ) The main thread and the autograd one are latency critical threads. They launch CPU/GPU/Accelerator kernels and if for some reason they get preempted, the rank can become a straggler in a distributed training application. By naming these threads we can debug performance issues that impact the latency sensitive threads. I used Kineto traces to verify if the thread names were propagated: <img width="851" alt="Screenshot 2024-03-04 at 3 07 43 PM" src="https://github.com/pytorch/pytorch/assets/23515689/68b4a09c-b8e5-4f14-a5c0-6593f866c03f"> Also: ``` nvidia-smi +-----------------------------------------------------------------------------+ \| Processes: \| \| GPU GI CI PID Type Process name GPU Memory \| \| ID ID Usage \| \|=============================================================================\| \| 0 N/A N/A 3065920 C ...me#python#py_version_3_10 1968MiB \| \| 1 N/A N/A 3065926 C ...me#python#py_version_3_10 1978MiB \| \| 2 N/A N/A 3065930 C ...me#python#py_version_3_10 2084MiB \| \| 3 N/A N/A 3065936 C ...me#python#py_version_3_10 2016MiB \| \| 4 N/A N/A 3065939 C ...me#python#py_version_3_10 1998MiB \| \| 5 N/A N/A 3065943 C ...me#python#py_version_3_10 2070MiB \| \| 6 N/A N/A 3065948 C ...me#python#py_version_3_10 2026MiB \| \| 7 N/A N/A 3065952 C ...me#python#py_version_3_10 2070MiB \| +-----------------------------------------------------------------------------+ [me@myhost ~]$ ps -T -p 3065920 PID SPID TTY TIME CMD 3065920 3065920 pts/14 00:01:04 pt_main_thread ... 3065920 3092181 pts/14 00:00:40 pt_autograd_d0 3065920 3092182 pts/14 00:00:00 pt_autograd_d1 3065920 3092183 pts/14 00:00:00 pt_autograd_d2 3065920 3092184 pts/14 00:00:00 pt_autograd_d3 3065920 3092185 pts/14 00:00:00 pt_autograd_d4 3065920 3092186 pts/14 00:00:00 pt_autograd_d5 3065920 3092187 pts/14 00:00:00 pt_autograd_d6 3065920 3092188 pts/14 00:00:00 pt_autograd_d7 ... ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121170 Approved by: https://github.com/albanD	2024-03-05 22:15:39 +00:00
Peter Bell	34a28f01dd	[Autograd] Improve error for leaf tensors as out argument to fallback (#121089 ) Closes #120988 Currently operators that hit the autograd fallback call `check_inplace` on all mutated inputs, including out arguments. This leads to a slightly confusing error message: ``` RuntimeError: a leaf Variable that requires grad is being used in an in-place operation. ``` Compared to functions that don't fallback, which raise ``` RuntimeError: add(): functions with out=... arguments don't support automatic differentiation, but one of the arguments requires grad. ``` This changes the error message to make clear the issue is with the out argument, but does not tighten the check to outright ban out arguments that require grad. Instead, I use the same checks from `check_inplace` which allows non-leaf tensors that require grad to pass without error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121089 Approved by: https://github.com/lezcano, https://github.com/soulitzer ghstack dependencies: #121142	2024-03-05 21:13:27 +00:00
cyy	6ecd65886a	Remove unnecessary const_casts (#121225 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/121225 Approved by: https://github.com/soulitzer	2024-03-05 17:34:24 +00:00
cyy	507611f9ae	[CUDACachingAllocator] Turn Allocator::allocate into non-const (#120969 ) Ideally, the method should be non-const since it changes the allocator state. Some const_casts are also removed in the way. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120969 Approved by: https://github.com/albanD	2024-03-05 09:53:05 +00:00
Bin Bao	bd19d6d822	[AOTI] Use torchgen to generate C shim functions (#120513 ) Summary: The current C shim layer manually implements a C interface for a handful of ops. Obviously that's not scalable if we want to extend it to cover all aten ops. This new torchgen script automatically generates C shim interfaces for CPU and CUDA backends. The interface follows the same parameter passing rules as the current C shim layer, such as * Use plain C data types to pass parameters * Use AtenTensorHandle to pass at::Tensor * Use pointer type to pass optional parameter * Use pointer+length to pass list * Use device_type+device_index to pass device * When a parameter is a pointer of pointer, e.g. AtenTensorHandle**, the script generates either a list of optional values or an optional list of values https://gist.github.com/desertfire/83701532b126c6d34dae6ba68a1b074a is an example of the generated torch/csrc/inductor/aoti_torch/generated/c_shim_cuda.cpp file. The current version doesn't generate C shim wrappers for all aten ops, and probably generates more wrappers than needed on the other hand, but it should serve as a good basis. This PR by itself won't change AOTI codegen and thus won't introduce any FC breakage. The actual wrapper codegen changes will come in another PR with some version control flag to avoid FC breakage. Differential Revision: [D54258087](https://our.internmc.facebook.com/intern/diff/D54258087) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120513 Approved by: https://github.com/jansel	2024-03-05 04:28:44 +00:00
Francesco Fusco	26431db939	[ONNX] Perform implicit casting of constants for the onnx::where operator (#118733 ) (#120619 ) This PR fixes the problem of having the `Where` operator bound to different types in cases where the dtype is not explicitly set. The PR extends the implicit casting to the onnx::Where operator to fix the issue, and includes the corresponding unit test. Fixes #118733 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120619 Approved by: https://github.com/BowenBao, https://github.com/thiagocrepaldi	2024-03-04 19:27:30 +00:00
rzou	3ef0befdc9	Better error messages for impl_abstract_pystub (#120959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959 Approved by: https://github.com/drisspg	2024-03-04 15:24:36 +00:00
Animesh Jain	7f81563e5e	[dynamo][guards-cpp-refactor] Skip type and length check guard for DictGuardManager (#120739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120739 Approved by: https://github.com/jansel ghstack dependencies: #120673	2024-03-02 13:15:53 +00:00
Animesh Jain	82d1465d8d	[dynamo][guards-cpp-refactor] DICT_CONTAINS guard (#120673 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120673 Approved by: https://github.com/jansel	2024-03-02 13:15:53 +00:00
Shuqiang Zhang	c8e56b4965	[c10d] dump from one and only one thread (PG0's monitor thread) (#120893 ) Summary: When there are multiple PGs in a process and a hardware failure happens, we found that multiple PGs/ threads in the same process are competing to dump the same records at the same time. The affects the reliability of dumps. In this PR, we will try to make the change such that only one thread/PG could dump: PG0's monitor thread. We use a static variable to indicate that something (e.g., collective timeout) has triggered the dump locally. monitor thread would dump debug info under any one of the 3 conditions: 1: this static variable is set to true by the watchdog thread when it detects a timeout or pipe dump signal 2: timeout signal is received from other ranks through tcpstore 3: no heartbeat of watchdog Test Plan: python test/distributed/test_c10d_nccl.py -k test_timeout_dumps_on_stuck_ranks Pull Request resolved: https://github.com/pytorch/pytorch/pull/120893 Approved by: https://github.com/wconstab	2024-03-02 00:13:13 +00:00
Will Constable	581fe26792	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito	2024-03-01 23:45:43 +00:00
albanD	8cb4855d1e	Release the GIL in serialization when it is safe to do so (#120818 ) In particular this ensures we release the GIL when serializing: - PyBytes objects (this is how we get the pickle object) - Storage objects Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818 Approved by: https://github.com/colesbury	2024-03-01 22:37:26 +00:00
Simon Fan	82b356193d	Move VariableInfo into its own file to avoid circular dependency (#120732 ) VariableInfo is used by both `custom_function.h` (in a templated class) and `compiled_autograd.h` (in a class with some templated methods). Another way could have been to make a `compiled_autograd.cpp` and forward declare VariableInfo, but this VariableInfo was also being used in other nodes like PyNode so it felt cleaner to do it this way. Differential Revision: [D54287007](https://our.internmc.facebook.com/intern/diff/D54287007) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120732 Approved by: https://github.com/jansel	2024-03-01 08:48:13 +00:00
Ma Jian	518a23bb03	support bool as Scalar Type in TorchScript (#113835 ) Fixes #112402 Fixes #75465 From the description in #75465 , the bool type should subtype from the int. and `register_prim_ops.cpp` already supports converting from bool to int or float. So this patch can fix bool as Scalar in TorchScirpt. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113835 Approved by: https://github.com/davidberard98	2024-03-01 04:20:15 +00:00
PyTorch MergeBot	76d3a6bb4a	Revert "[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 )" This reverts commit `381a7ad3f1`. Reverted https://github.com/pytorch/pytorch/pull/120745 on behalf of https://github.com/kit1980 due to The new test fails internally, see D54343421 ([comment](https://github.com/pytorch/pytorch/pull/120745#issuecomment-1972047106))	2024-02-29 22:06:13 +00:00
Edward Z. Yang	0a7666801d	SymIntify prod_backward (#120776 ) Fixes https://github.com/pytorch/pytorch/issues/120608 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120776 Approved by: https://github.com/albanD	2024-02-29 20:05:22 +00:00
Shuqiang Zhang	313abcdba2	[c10d] fix the unwanted reason (#120863 ) Summary: Addressing #120849. Current c10d treat a reason as a failure, hence give some unwanted false postiive errors. This is a quick fix, but we need to revisit the error handling logic Pull Request resolved: https://github.com/pytorch/pytorch/pull/120863 Approved by: https://github.com/kwen2501	2024-02-29 19:58:11 +00:00
Bin Bao	52e3c78a43	[AOTI][refactor] Move a few util functions in atoi_torch (#119987 ) Summary: Move these util functions from an anonymous namespace to a common header so that later torchgen-ed files can use them. Differential Revision: [D54258088](https://our.internmc.facebook.com/intern/diff/D54258088) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119987 Approved by: https://github.com/chenyang78	2024-02-29 15:46:47 +00:00
Shengbao Zheng	5b9e5f854b	[profiler] Log process group id instead of backend id (#120475 ) Summary: https://github.com/pytorch/pytorch/pull/104373 introduced backend_id > an unique ID for the actual backend object, this is also exposed in record_param_comms, so we can correlate these collectives with the right backend object. However, it is inconvenient to correlate collectives with backend id. Instead, using pg id(uid) to correlate directly is a better solution. This PR change the ID information exposted in record_param_comms from backend_id to pg_id. Differential Revision: D53558257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120475 Approved by: https://github.com/aaronenyeshi	2024-02-29 15:04:33 +00:00
Yifu Wang	f988f649be	[IntraNodeComm] accept P2P buffer size as constructor argument (#120856 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120856 Approved by: https://github.com/wanchaol ghstack dependencies: #120855	2024-02-29 11:43:52 +00:00
Yifu Wang	22b5548f5d	[IntraNodeComm] refactor all_reduce variants as private methods (#120855 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120855 Approved by: https://github.com/Chillee, https://github.com/wanchaol	2024-02-29 11:43:52 +00:00
Sergii Dymchenko	09aefe1502	Fix ouput typos (#120870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120870 Approved by: https://github.com/clee2000	2024-02-29 08:29:14 +00:00
Animesh Jain	82cbd9b131	[dynamo][guards-cpp-refactor] PythonLambdaGuardAccessor (#120730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120730 Approved by: https://github.com/jansel ghstack dependencies: #120864	2024-02-29 07:25:13 +00:00
Will Constable	381a7ad3f1	[C10D] Add ProcessGroup op_id to track ops inside coalescing region (#120745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120745 Approved by: https://github.com/zdevito ghstack dependencies: #120724, #120270	2024-02-29 01:03:31 +00:00
Will Constable	f85d3a022c	[C10D] Fix pointToPoint op Flight Recording (#120270 ) Fix and test issues with both coalesced and individual send/recv ops Considered an alternate approach and then ditched it - alternate approach: #119757 - reason ditched: prefer recording individual collective events inside coalescing region instead of just the event at the end of the region, which also would not have tensor sizes or opnames without additional state variables added Another approach also ditched - record events on workEnqueue instead of initWork - reason ditched: too messy to get input/output shapes tagged on recording when recording in workEnqueue. Adding the info onto the Work obj would be possible, but adds to overhead of copying Works which we do on every collective. We can get info off the input/output tensors directly in initWork, but we don't want to keep refs to those tensors alive while the work is Enqueued, so we'd have to specifically copy size lists or something. This PR instead avoids creating a work inside pointToPoint when coalescing is active. Instead, only at endCoalescing() is a work finally intialized and enqueued. But it adds a record() call inside pointToPoint() instead of creating a work, during coalescing. This record() call picks up tensor shapes and op names. It ALSO changes initWork to accept a 'record' argument. This defaults to false, and should only be set to true if the caller ensures the work will be enqueued by workEnqueue, ensuring its cuda events are live when used by flight recorder's update_state(). The testing uncovers some odd pre-existing behavior and leaves them alone for now. We could change some of these - seq starts off at 1, not 0 for first op (but this is inconistent) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120270 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #120724	2024-02-29 01:03:31 +00:00
Will Constable	7f4d673885	[C10D] Add record_id to flight recorder (#120724 ) In cases where sequence number is shared between events (e.g. coalesced collectives) we want to ensure a unique (and ordered) ID per record. Note: the records are already in a list, so their ID could be implicitly observed. But (1) it's a ring buffer, so absolute ID is lost once the buffer rolls over once, (2) users may sort or process or filter their flight records, so having the ID be an explicit member of an entry is still useful Pull Request resolved: https://github.com/pytorch/pytorch/pull/120724 Approved by: https://github.com/zdevito	2024-02-29 01:03:31 +00:00
PyTorch MergeBot	4903e33e19	Revert "Capture non tensor arguments in record_function (#120017 )" This reverts commit `5c5b71b6ee`. Reverted https://github.com/pytorch/pytorch/pull/120017 on behalf of https://github.com/soulitzer due to regresses perf on autograd Function when using profiler ([comment](https://github.com/pytorch/pytorch/pull/120017#issuecomment-1969883792))	2024-02-28 20:43:33 +00:00
Jason Ansel	01ec8df6d8	[Compiled Autograd] Introduce BackwardState capture (#120382 ) This adds support for backwards hooks that are both: 1) Interior to the graph; and 2) Dynamically generated (e.g. lambdas) We do this by creating a BackwardState object that is used to register the hooks in the forward, then populated by dynamo after the forwards runs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120382 Approved by: https://github.com/xmfan	2024-02-28 20:36:47 +00:00
Shengbao Zheng	11de40f82f	[flight recorder] record process group configuration (#120262 ) Summary: Record process group configuration (i.e. ranks involved in a process group) to facilitate NCCL related debugging. Differential Revision: D53792087 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120262 Approved by: https://github.com/shuqiangzhang	2024-02-28 20:31:08 +00:00
PyTorch MergeBot	a9d9077f12	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit `7c556428c7`. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/kit1980 due to breaking internal builds, see D54286923 ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1969634480))	2024-02-28 18:57:09 +00:00
Rohan Potdar	f67c77c497	Update engine.cpp (#120773 ) Minor comment fix; `backward` and `grad` are flipped here. See https://pytorch.org/docs/stable/_modules/torch/autograd.html#backward Pull Request resolved: https://github.com/pytorch/pytorch/pull/120773 Approved by: https://github.com/albanD, https://github.com/janeyx99, https://github.com/soulitzer	2024-02-28 18:23:35 +00:00
Xunsong, Huang	0ab2ec3738	[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 ) This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend. # Motivation The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats. # Principles 1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts. 2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths. 3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways. # Solutions ### a. Pathway Identification: Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling. ### b. `use_device` Logic Revision: With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction. ### c. Kernel List Segregation: To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects. ### d. Formatted Output: To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data. # Additional Enhancements ### a. Enumerations in `.pyi` Files: Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU. ### b. Correct DeviceType Returns: Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`. ### c. Bug Fixes in `cuda_corr_map`: Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee. # Further Abstraction Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`. # Next Pull Request The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices. We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-02-28 17:50:32 +00:00
Chao Zhou	a11a49af58	Add NCCL work sequence number to work info (#120596 ) Summary: Expose sequence number to work info. The number can help applications identify a NCCL work more precisely. Test Plan: 1. pytest test/distributed/test_c10d_nccl.py::WorkHookTest::test_on_completion_hook_seq 2. pytest test/distributed/test_c10d_nccl.py::WorkHookTest Differential Revision: D54180050 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120596 Approved by: https://github.com/kwen2501	2024-02-28 07:54:37 +00:00
Yu, Guangye	12995a5d9d	[2/2] Intel GPU Runtime Upstreaming for Generator (#118613 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 05:28:11 +00:00
Yang Chen	1627d9e06d	[aot_inductor] added a utility function aoti_torch_print_tensor_handle (#120660 ) Added a function to print tenosr values for a tensor handle. It can be injected to the cpp wrapper code and help debug numerical issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120660 Approved by: https://github.com/desertfire	2024-02-28 02:08:34 +00:00
Yu, Guangye	1aa9099839	[CLANGTIDY] Enable clang-tidy in torch/csrc/xpu (#120616 ) # Motivation refer to [#118504](https://github.com/pytorch/pytorch/pull/118504), enabling clang-tidy in `torch/csrc/xpu`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120616 Approved by: https://github.com/albanD	2024-02-28 01:35:25 +00:00
Tobias Ringwald	7c556428c7	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD, https://github.com/huydhn	2024-02-27 07:05:48 +00:00
Levy Zhao	b6139b1e57	[PyTorch][CUDA Caching Allocator] Export sync-stream-and-free-HBM counter in memory_stats for performance debugging (#120050 ) Differential Revision: D53734057 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120050 Approved by: https://github.com/xw285cornell	2024-02-27 04:34:53 +00:00
Animesh Jain	63f874b476	[dynamo][guards-cpp-refactor] DictGetItemGuardAccessor for f_locals (#120593 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120593 Approved by: https://github.com/jansel	2024-02-27 03:13:55 +00:00
William Wen	ecb3f33a1a	[dynamo] fix segfault in _debug_get_cache_entry_list (#120635 ) Fix https://github.com/pytorch/pytorch/issues/120607. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120635 Approved by: https://github.com/jansel	2024-02-26 23:31:09 +00:00
Animesh Jain	a299db2983	[dynamo][guards-cpp-refactor] NO_HASATTR guard (#120469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120469 Approved by: https://github.com/jansel	2024-02-26 04:37:40 +00:00
Shan19900305	685d862c45	Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new. (#119263 ) 1) Using items stored in torch._tensor_classes to check item passed from python side; 2) Add SparsePrivateUse1 in backend_to_string, layout_from_backend and check_base_legacy_new; 3) Using more general API to get python module name in get_storage_obj and get_name functions. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119263 Approved by: https://github.com/ezyang	2024-02-26 01:54:30 +00:00
Animesh Jain	4328e772bf	[dynamo][guards-cpp-refactor] DICT_VERSION guard (#120416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120416 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344, #120359	2024-02-25 23:24:24 +00:00
Animesh Jain	c269e48af0	[dynamo][guards-cpp-refactor] DictGuardManager (#120359 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120359 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342, #120344	2024-02-25 23:24:24 +00:00
Animesh Jain	775a4388d9	[dynamo][guards-cpp-refactor] WEAKREF_ALIVE guard (#120344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120344 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096, #120342	2024-02-25 23:24:04 +00:00
cyy	81f0b2c14e	[Clang-tidy header][19/N] Enable clang-tidy on torch/csrc/autograd/profiler_legacy.* (#120552 ) This PR enables clang-tidy on torch/csrc/autograd/profiler_legacy.* and cleans some path rules of clang-tidy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120552 Approved by: https://github.com/Skylion007	2024-02-25 03:29:40 +00:00
Shuqiang Zhang	8e20385447	[c10d] fix the macro definition of NCCL_COMM_DUMP (#120502 ) Summary: Only if both macros are defined, should we dump the comm dump, otherwise, use the original definition. The previous implementation missed the function definition when IS_NCCL_EXP is defined but NCCL_COMM_DUMP is not defined Test Plan: Build and unit test Pull Request resolved: https://github.com/pytorch/pytorch/pull/120502 Approved by: https://github.com/dsjohns2, https://github.com/Skylion007	2024-02-23 20:59:39 +00:00
Animesh Jain	007606e520	[dynamo][guards-cpp-refactor] TENSOR_MATCH guard (#120342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120342 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093, #120096	2024-02-23 20:10:09 +00:00
Animesh Jain	4b65d192f0	[dynamo][guards-cpp-refactor] DYNAMIC_INDICES guard (#120096 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120096 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123, #120093	2024-02-23 20:10:09 +00:00
Animesh Jain	a92ce46dc3	[dynamo][guards-cpp-refactor] GlobalWeakRefGuardAccessor (#120093 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120093 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119, #120123	2024-02-23 20:10:01 +00:00
Animesh Jain	bb331b1eb5	[dynamo][guards-cpp-refactor] LENGTH_CHECK guard (#120123 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120123 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091, #120119	2024-02-23 20:09:52 +00:00
Animesh Jain	2eac593ffd	[dynamo][guards-cpp-refactor] TUPLE_ITERATOR_LEN guard (#120119 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120119 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089, #120091	2024-02-23 20:09:43 +00:00
Animesh Jain	da95421f05	[dynamo][guards-cpp-refactor] TupleIteratorGetItemAccessor (#120091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120091 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068, #120089	2024-02-23 20:09:34 +00:00
Shuqiang Zhang	39f0a5ecc9	[c10d] simplify the dump timeout logic and unify the async call (#120331 ) Summary: The current dump timeout logic is a bit cumbersome as it needs 2 times: 1. timeout, 2. wake up time. And in theory the caller just needs to wait for a max of timeout value for the dump and declare the dump to be either successful or not. Also we unify the async call using std::async instead of a customized async lauch function for each operation. Test Plan: Unit tests Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120331 Approved by: https://github.com/wconstab	2024-02-23 19:46:40 +00:00
cyy	97918e8c37	[Clang-tidy header][18/N] Enable clang-tidy on headers in torch/csrc/cuda (#118504 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/118504 Approved by: https://github.com/albanD	2024-02-23 16:47:33 +00:00
Shuqiang Zhang	2b0168aeb0	[c10d] update the work progress of PG periodically (#120438 ) Summary: Previously, I added lastEnqueuedSeq_ and lastCompletedSeq_ to store the states of PG progress but log only when there is timeout detected. We found it is not enough since the 'straggler' itself might not detect the timeout and hence there is no log from the 'straggler'. In this PR, we can log these states periorically so that it would be much easier for us to identify the straggler by checking which rank has the smallest number of lastEnqueuedSeq_ Test Plan: Log adding, build success Pull Request resolved: https://github.com/pytorch/pytorch/pull/120438 Approved by: https://github.com/wconstab, https://github.com/XilunWu, https://github.com/kwen2501	2024-02-23 01:40:43 +00:00
Yifu Wang	1c9fc720ae	Change the .clone() in native funcol's all_reduce to use at::MemoryFormat::Contiguous (#120042 ) Summary: While I think it probably makes more sense to only require `all_reduce` input to be non-overlapping and dense, today `ProcessGroupNCCL` requires it to be contiguous. This is also what the `all_reduce` in non-native funcol does. Also marking a test affected by this with `@run_with_both_funcol_impls`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120042 Approved by: https://github.com/wanchaol	2024-02-22 20:24:15 +00:00
briancoutinho	b88621040a	[profiler] Add kineto init delay when used in daemon mode (#120276 ) Fixes #112389 ## About PyTorch (Kineto) profiler registers with the profiling daemon Dynolog to enable on-demand profiling. The user should only need to set the env variable `KINETO_USE_DAEMON`. To enable this we need to initialize kineto library early rather than lazily on a PyTorch profiler call. This initialization happens in a static initializer. - Kineto init function basically registers a callback using the CUDA CUPTI library https://github.com/pytorch/kineto/blob/main/libkineto/src/init.cpp#L130-L148 - However, the above needs the dynamic linking to libcupti.so to have taken place. - I understand now that static initializations of compilation units will be called before the dynamic linking leading to a segfault in #112389 ![image](https://github.com/pytorch/pytorch/assets/6922212/29c9e79b-8080-4198-aaae-8a5696dccaec) ## Workaround We add a delay in the initialization that can be configured using the env variable 'KINETO_DAEMON_INIT_DELAY_S'. May not be the best but it could help resolve the issue. ## Testing Tested this out with [linear_model_example.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) First export the daemon env variable ### Without any delay ``` >$ python3 linear_model_example.py INFO:2024-02-21 19:34:50 2366287:2366287 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclientb8f91363-d8d6-47a7-9103-197661e28397 status = initialized INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:50 2366287:2366287 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu 99 1385.468505859375 ``` ### With 5 seconds delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=5 python3 linear_model_example.py cpu 99 284.82305908203125 10099 8.817167282104492 INFO:2024-02-21 19:34:26 2359155:2359214 init.cpp:131] Registering daemon config loader, cpuOnly = 1 ERROR: External init callback must run in same thread as registerClient (1782580992 != -1922169024) INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient49270a3f-e913-4ea6-b9e0-cc90a853a869 status = initialized INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:34:26 2359155:2359214 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 20099 8.817167282104492 ``` ### With an invalid delay ``` >$ KINETO_DAEMON_INIT_DELAY_S=abc python3 linear_model_example.py INFO:2024-02-21 19:35:02 2369647:2369647 init.cpp:131] Registering daemon config loader, cpuOnly = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0e12a349-af7b-4322-901d-1ff22f91fd4c status = initialized INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-02-21 19:35:02 2369647:2369647 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 cpu ``` ### Unit test updated as well. ## Impact This should not impact any general user. The initialization only occurs if `KINETO_USE_DAEMON` is set in the environment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120276 Approved by: https://github.com/anupambhatnagar, https://github.com/aaronenyeshi	2024-02-22 18:17:33 +00:00
Sheng Fu	5c5b71b6ee	Capture non tensor arguments in record_function (#120017 ) Summary: RECORD_FUNCTION only capture the argument when it is a Tensor. However, it is very common for user to use the argument with primitive data type (int, float, index, bool). This DIFF is to support non tensor arguments in RECORD_FUNCTION. Test Plan: unit test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 test_execution_trace_alone test_execution_trace_with_kineto test_execution_trace_start_stop test_execution_trace_repeat_in_loop test_execution_trace_no_capture Differential Revision: D53674768 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120017 Approved by: https://github.com/soulitzer	2024-02-22 09:40:08 +00:00
PyTorch MergeBot	fff9d98e58	Revert "Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 )" This reverts commit `e0268821dd`. Reverted https://github.com/pytorch/pytorch/pull/119639 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think the Window failures are legit as they are failing now in trunk, i.e. `450339ab2d` ([comment](https://github.com/pytorch/pytorch/pull/119639#issuecomment-1958428416))	2024-02-22 00:12:54 +00:00
Tobias Ringwald	e0268821dd	Increased compile time max GPUs to 512. Switched to int16_t DeviceIndex. (#119639 ) Fixes #115331. This PR increases the number of valid GPU devices to 512 (from 64) in order to future-proof PyTorch for providers that offer [single nodes with a large device count](https://www.tensorwave.com/). Until now, `DeviceIndex` was an `int8_t`, thus multiple changes were necessary: - `DeviceIndex` changed to `int16_t`. Updated consumers that assume it to be an `int8_t`. - Updated bounds checking for `torch.device()` in the Python frontend. Right now, we allow funny things like `torch.device('cpu', 200).index == -56`, which is undefined behavior. I inserted some checks to only allow values between 0 and `c10::Device::MAX_NUM_DEVICES - 1`. - Updated the `ArgumentInfo` struct as it hardcodes the device index as 8 bit field [^1]. Might be a breaking change, not sure if users rely on this. - Introduced `c10::Device::MAX_NUM_DEVICES` as a replacement for the old `C10_COMPILE_TIME_MAX_GPUS` [^1]: This field was unsigned, so I guess this has also been undef behavior the whole time? Our default device index is -1, so this always wrapped around to 255 when written to the `ArgumentInfo` struct. When I switched the `DeviceIndex` to `int16_t`, it actually stayed 255 after unpacking from `ArgumentInfo` again, as the `DeviceIndex` was now wide enough that it didn't wrap back to -1. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119639 Approved by: https://github.com/cyyever, https://github.com/albanD	2024-02-21 21:10:49 +00:00
soulitzer	27c5bbe5cb	Add is_nested_int() (#119975 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119975 Approved by: https://github.com/jbschlosser ghstack dependencies: #119661, #119974	2024-02-21 21:10:02 +00:00
Animesh Jain	9c64068ef8	[dynamo][guards-cpp-refactor] TypeGuardAccessor (#120089 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120089 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067, #120068	2024-02-21 17:56:48 +00:00
Animesh Jain	ec6783990a	[dynamo][guards-cpp-refactor] GlobalsGuardAccessor (#120068 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120068 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065, #120067	2024-02-21 17:56:48 +00:00
Animesh Jain	66c52d678f	[dynamo][guards-cpp-refactor] GetItemGuardAccessor (#120067 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120067 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064, #120065	2024-02-21 17:56:36 +00:00
Animesh Jain	7a0c2a9d0a	[dynamo][guards-cpp-refactor] NO_TENSOR_ALIASING guard (#120065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120065 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062, #120064	2024-02-21 17:56:18 +00:00
Animesh Jain	8d5ae8c0b3	[dynamo][guards-cpp-refactor] TENSOR_ALIASING guard (#120064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120064 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061, #120062	2024-02-21 17:56:05 +00:00
Animesh Jain	034955b2fc	[dynamo][guards-cpp-refactor] DATA_PTR_MATCH guard (#120062 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120062 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060, #120061	2024-02-21 17:55:46 +00:00
Animesh Jain	cc6cf89c30	[dynamo][guards-cpp-refactor] GLOBAL_STATE guard (#120061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120061 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833, #120060	2024-02-21 17:55:32 +00:00
Animesh Jain	5066bec743	[dynamo][guards-cpp-refactor] DEFAULT_DEVICE guard (#120060 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120060 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827, #119833	2024-02-21 17:55:17 +00:00
Shuqiang Zhang	a24cba35b0	[c10d][flight recorder] dump additinal NCCL debug info (#120063 ) Summary: This PR is mainly about flight recorder side of changes that takes a map of maps as input, and dump it as picklable. Also add functions that should be compiled only when NCCL_COMM_DUMP is defined Test Plan: Integration tests with NCCL would be done later, here we only do the c10d side of dump test, aka,NCCLTraceTest Testing the dump function is a bit tricky as we don't have existing C++ unit tests for them. So we still use the Python NCCLTraceTest with the python binding of _dump_nccl_trace(), we manually fed the dump_nccl_trace with a map of test info, and assert the pickle result and print the converted python dict: ``` (sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (main)]$ python test/distributed/test_c10d_nccl.py NCCLTraceTest NCCL version 2.19.3+cuda12.0 [rank0]:[E ProcessGroupNCCL.cpp:1200] [PG 0 Rank 0] ProcessGroupNCCL preparing to dump debug info. .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} {'ncclID2': {'Key2': 'Value2', 'Key1': 'Value1'}, 'ncclID1': {'Key2': 'Value2', 'Key1': 'Value1'}} .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 .NCCL version 2.19.3+cuda12.0 . ---------------------------------------------------------------------- Ran 8 tests in 95.761s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120063 Approved by: https://github.com/wconstab	2024-02-21 16:35:23 +00:00
cyy	3cd6a21e8f	[DeviceIndex][6/N] Use DeviceIndex in more places (#120133 ) This PR follows the series of patches beginning with #119142 and fixes various XPU and python related methods to use DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120133 Approved by: https://github.com/Skylion007	2024-02-21 06:24:23 +00:00
Yifu Wang	2d6c0cc81b	Run test_functional_api.py with both legacy and native funcol impls (#119982 ) Additional changes: tests in test_functional_api.py uses multi-threaded pg which is implemented in Python. For the native ops to call into the Python pg implementation, glue code in PyProcessGroup is required for each collective. This PR also adds a few pieces of previously missing glue code, which are necessary for running test_functional_api.py with native funcol. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119982 Approved by: https://github.com/wanchaol	2024-02-20 21:15:37 +00:00
Animesh Jain	389b56b4c4	[dynamo][guards-cpp-refactor] GetAttrGuardAccessor (#119833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119833 Approved by: https://github.com/jansel ghstack dependencies: #119822, #119827	2024-02-20 05:33:08 +00:00
Animesh Jain	96f45d15d8	[dynamo][guards-c++-refactor] EQUALS_MATCH guard (#119827 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119827 Approved by: https://github.com/jansel ghstack dependencies: #119822	2024-02-20 05:33:08 +00:00
Animesh Jain	0802951081	[dynamo][guards-c++-refactor] Introduce LeafGuard, GuardManager and GuardAccessor classes (#119822 ) The full blown implementation is in this stack - https://github.com/pytorch/pytorch/pull/110590 which is passing all the test cases on CI. That stack is hard to review. So, breaking apart. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119822 Approved by: https://github.com/jansel	2024-02-20 05:33:08 +00:00
Yifu Wang	40786ca509	Handle unwaited work objects on process termination (#119881 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119881 Approved by: https://github.com/wconstab	2024-02-19 02:46:02 +00:00
cyy	a9953a5ef3	Remove unused c10/util/C++17.h inclusion and outdated checks (#120149 ) This is a continued work to clean up pre-C++17 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120149 Approved by: https://github.com/ezyang	2024-02-17 14:28:17 +00:00
Shuqiang Zhang	30000aa3fd	[c10d] remove one line of verbose log (#120138 ) Summary: I don't find exiting DBG mode support in c10d. This is flooding the log, removing it to unblock user Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/120138 Approved by: https://github.com/wconstab	2024-02-17 06:39:57 +00:00
Bin Bao	fa0e39560c	[AOTI] Fix a typo (#120094 ) Differential Revision: [D53861810](https://our.internmc.facebook.com/intern/diff/D53861810) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120094 Approved by: https://github.com/khabinov, https://github.com/sijiac	2024-02-17 05:28:58 +00:00
Aaron Enye Shi	7973ac586d	[Memory Snapshot] Add CUDAAllocatorConfig details into snapshot metadata (#119404 ) Summary: Include the CUDAAllocatorConfig at the time of snapshot into the snapshot file. These include adding variables: ``` double garbage_collection_threshold; size_t max_split_size; size_t pinned_num_register_threads; bool expandable_segments; bool release_lock_on_cudamalloc; bool pinned_use_cuda_host_register; std::string last_allocator_settings; std::vector<size_t> roundup_power2_divisions; ``` Test Plan: `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True ` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'expandable_segments:True', 'max_split_size': -1, 'garbage_collection_threshold': 0.0, 'expandable_segments': True, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 0, '2': 0, '4': 0, '8': 0, '16': 0, '32': 0, '64': 0, '128': 0, '256': 0, '512': 0, '1024': 0, '2048': 0, '4096': 0, '8192': 0, '16384': 0, '32768': 0}} ``` `PYTORCH_CUDA_ALLOC_CONF="max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]"` produces ``` {'PYTORCH_CUDA_ALLOC_CONF': 'max_split_size_mb:2000,roundup_power2_divisions:[256:1,512:2,1024:4,>:8]', 'max_split_size': 2097152000, 'garbage_collection_threshold': 0.0, 'expandable_segments': False, 'pinned_num_register_threads': 1, 'release_lock_on_cudamalloc': False, 'pinned_use_cuda_host_register': False, 'roundup_power2_divisions': {'1': 1, '2': 1, '4': 1, '8': 1, '16': 1, '32': 1, '64': 1, '128': 1, '256': 1, '512': 2, '1024': 8, '2048': 8, '4096': 8, '8192': 8, '16384': 8, '32768': 8} } ``` Differential Revision: D53536199 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/119404 Approved by: https://github.com/zdevito	2024-02-17 01:16:37 +00:00
Yifu Wang	4ac857f94e	Support broadcast in native funcol (#119229 ) ### Summary @LucasLLC recently implemented `broadcast` in funcol. This is not yet available in the native funcol ops. This PR adds support for broadcast for native funcol. - Added `_c10d_functional::broadcast` and `_c10d_functional::broadcast_` - Integrated with python functol broadcast and `AsyncCollectiveTensor` - Implemented Inductor lowering. Verified correctness and buffer reuse behavior - Validated dynamo traceability - Validated AOTInductor compile-ability Pull Request resolved: https://github.com/pytorch/pytorch/pull/119229 Approved by: https://github.com/wanchaol ghstack dependencies: #119104	2024-02-16 21:01:34 +00:00
soulitzer	312ce35c1f	Rename singleton int to nested int (#119661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661 Approved by: https://github.com/ezyang	2024-02-16 19:21:17 +00:00
Dan Johnson	124c251510	Guarantee init cuda before attaching hooks (#120052 ) Summary: If cuda is not initialized before calling attachAllocatorTraceTracker, then the CudaCachingAllocator device_allocator is empty which means that the registration hooks are not setup. This means that a new segment_alloc will not be registered causing an expensive dynamic registration each time the segment is used. The fix is to guarantee that cuda is initialized before attaching the hooks. If cuda is already initialized, then this lazyInitCUDA is a no-op. Test Plan: Testing this on fsdp+tp example model where cuda is not initialized before init_process_group. Job without the fix keeps dynamically registering: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-j544j0vn7zqh4c?job_attempt=0&version=0&env=PRODUCTION The following keeps looping: [0]:2024-02-14T10:48:18.873079 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: registered buffer 0x7f6ebe000000 len 608124000, state 1 [0]:2024-02-14T10:48:18.873087 twshared0039:4836:6232 [0] NCCL INFO *dynamicRegist = true [0]:2024-02-14T10:48:18.903234 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregister buffer 0x7f6ebe000000 len 608124000, state 1 [0]:2024-02-14T10:48:18.903240 twshared0039:4836:6232 [0] NCCL INFO CTRAN-MAPPER: deregiter buffer 0x7f6ebe000000 len 608124000 Job with the fix does not have this issue: https://www.internalfb.com/mlhub/pipelines/runs/mast/torchx-fsdp_2d_main-hzm5dwqncr7l7?version=0&env=PRODUCTION Reviewed By: minsii, kwen2501, xw285cornell Differential Revision: D53770989 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120052 Approved by: https://github.com/kwen2501	2024-02-16 17:36:53 +00:00
Yu, Guangye	8f9f12c068	Intel GPU Runtime Upstreaming for Device Allocator (#118091 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel. # Design In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below. <p align="center"> <img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218"> </p> # Additional Context We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`. Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR. In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`. The differences with CUDA: only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment... Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #117611, #117619, #117734	2024-02-16 06:46:00 +00:00
Mu-Chu Lee	b8be8b639f	Add Runtime Constant-Folding function of AOTInductor for AOTInductorModels used internally. (#119823 ) Summary: 1. Make sure folded constants generated internally doesn't get exposed. 2. Add runConstantFolding and related API calls Test Plan: ```buck2 run mode/opt-split-dwarf -c fbcode.nvcc_arch=v100,a100 caffe2/caffe2/fb/predictor/tests_gpu:pytorch_predictor_container_gpu_test -- --gtest_filter=PyTorchPredictorContainerTest.LoadAOTInductorModel ``` The test triggers the added predictor tests `test_aot_inductor_merge_net_file_*.predictor_20240206`, which would trigger runConstantFolding from predictor's module loading. Reviewed By: SherlockNoMad Differential Revision: D53718139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119823 Approved by: https://github.com/chenyang78	2024-02-16 06:45:48 +00:00
Yu, Guangye	4dc75f9084	Intel GPU Runtime Upstreaming for Event (#117734 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`. # Design `XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively. # Additional Context It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA. lack of the below APIs: - `torch.cuda.Event.ipc_handle` - `CUDAEvent`'s constructor with `IpcEventHandle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117611, #117619	2024-02-16 06:28:26 +00:00
cyy	d4882e438a	[DeviceIndex][5/N] Use DeviceIndex in more places (#119866 ) This PR follows the series of patches beginning with #119142 and fixes various CUDA related methods to use DeviceIndex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119866 Approved by: https://github.com/Skylion007	2024-02-15 07:01:43 +00:00
cyy	5f9b432494	[2/N] Replace std::tie with structural binding (#119879 ) This PR follows #119774, Python generated code was changed to use structural binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119879 Approved by: https://github.com/albanD	2024-02-15 02:56:34 +00:00
Eddie Yan	cd380c794f	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-02-14 22:02:06 +00:00
Joel Schlosser	9ec8dd2467	Reify view_func() closures as ViewFuncs (#118404 ) Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on. ```cpp /// Base class for view functions, providing reapplication of a view on a new base. /// Each view op should get a codegenerated subclass of this class containing /// any state needed to reconstruct the view. The class also provides convenience /// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification, /// where we want to use symbolic values or fake tensors instead. struct TORCH_API ViewFunc { virtual ~ViewFunc() {} /// Returns any SymInts in the saved state. virtual std::vector<c10::SymInt> get_symints() const { return {}; } /// Returns the number of SymInts in the saved state. virtual size_t num_symints() const { return 0; } /// Returns any tensors in the saved state. virtual std::vector<at::Tensor> get_tensors() const { return {}; } /// Returns the number of tensors in the saved state. virtual size_t num_tensors() const { return 0; } /// Reapplies the view on the given base using the saved state. virtual at::Tensor operator()(const at::Tensor&) const = 0; /// Returns a clone of this ViewFunc, optionally with the specified saved state. virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0; protected: /// Sets the values of any SymInts in the saved state. The input vector size must /// match the number of SymInts in the saved state (i.e. the size of the list /// returned by get_symints()). virtual void set_symints(std::vector<c10::SymInt>) {} /// Sets the values of any Tensors in the saved state. The input vector size must /// match the number of Tensors in the saved state (i.e. the size of the list /// returned by get_tensors()). virtual void set_tensors(std::vector<at::Tensor>) {} }; ``` New codegen files: * `torch/csrc/autograd/generated/ViewFunc.h` * `torch/csrc/autograd/generated/ViewFuncs.cpp` The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd. Example codegen for `slice.Tensor`: ```cpp // torch/csrc/autograd/generated/ViewFuncs.h #define SLICE_TENSOR_VIEW_FUNC_AVAILABLE struct SliceTensorViewFunc : public torch::autograd::ViewFunc { SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step) {}; virtual ~SliceTensorViewFunc() override {}; virtual std::vector<c10::SymInt> get_symints() const override; virtual size_t num_symints() const override; virtual std::vector<at::Tensor> get_tensors() const override; virtual size_t num_tensors() const override; virtual at::Tensor operator()(const at::Tensor&) const override; virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const override; protected: virtual void set_symints(std::vector<c10::SymInt>) override; virtual void set_tensors(std::vector<at::Tensor>) override; private: int64_t dim; c10::optional<c10::SymInt> start; c10::optional<c10::SymInt> end; c10::SymInt step; }; ... // torch/csrc/autograd/generated/ViewFuncs.cpp std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const { ::std::vector<c10::SymInt> symints; symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); if(start.has_value()) symints.insert(symints.end(), (start)); if(end.has_value()) symints.insert(symints.end(), (end)); symints.push_back(step); return symints; } size_t SliceTensorViewFunc::num_symints() const { return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); } void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) { TORCH_INTERNAL_ASSERT(symints.size() == num_symints()); auto i = 0; if(start.has_value()) start = symints[i]; i += (start.has_value() ? 1 : 0); if(end.has_value()) end = symints[i]; i += (end.has_value() ? 1 : 0); step = symints[i]; } std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const { ::std::vector<at::Tensor> tensors; return tensors; } size_t SliceTensorViewFunc::num_tensors() const { return static_cast<size_t>(0); } void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) { TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors()); } at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const { return at::_ops::slice_Tensor::call(input_base, dim, start, end, step); } std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set( std::optional<std::vector<c10::SymInt>> symints, std::optional<std::vector<at::Tensor>> tensors) const { auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step); if (symints.has_value()) { output->set_symints(std::move((symints))); } if (tensors.has_value()) { output->set_tensors(std::move((tensors))); } return output; } ``` The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification. For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly. ```sh python test/test_autograd.py -k test_view_func_replay python test/test_ops.py -k test_view_replay ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404 Approved by: https://github.com/ezyang	2024-02-14 22:00:43 +00:00
atalman	244b124bb8	Add linux cpu test for 3.12 (#117853 ) This is continuation of work: https://github.com/pytorch/pytorch/pull/113987 Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117853 Approved by: https://github.com/albanD	2024-02-14 20:52:23 +00:00
cyy	87c6cd2f00	[1/N] Replace std::tie with structural binding (#119774 ) This PR replaces some std::tie calls with structural binding from C++17. This not only makes the code more compact, but also has some performance gain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119774 Approved by: https://github.com/albanD, https://github.com/malfet	2024-02-14 09:25:04 +00:00
Shuqiang Zhang	a45c627f27	[c10d][flight recorder] store a copy of string in entry (#119837 ) Summary: Previously, we just store the char pointer in entry, the string is a temp object and will be destructed when we want to dump/access it. A quick fix is to store a copy of the string, but without changing the upstream char*. An alternative is to change every profilingTitle into std:string, this however would needs comprehensive overhall of the code up to the c10d::work layer above workNCCL and RecordFunction etc. We chose the first option for this change Resolve #119808 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119837 Approved by: https://github.com/zdevito, https://github.com/wconstab	2024-02-14 09:13:56 +00:00
Jason Ansel	75a6d6aef7	[inductor] Support storage resizing (#119749 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119749 Approved by: https://github.com/yf225 ghstack dependencies: #119647, #119671	2024-02-14 03:03:38 +00:00
cyy	cb0886ecf2	[DeviceIndex][4/N] Use DeviceIndex in more places (#119741 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119741 Approved by: https://github.com/aaronenyeshi, https://github.com/ezyang	2024-02-14 00:29:10 +00:00
Jason Ansel	cf117e37d5	Refactor THPStorage_resize_ (#119671 ) Moving code around to allow it to be reused in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119671 Approved by: https://github.com/yf225 ghstack dependencies: #119647	2024-02-13 23:28:47 +00:00
albanD	ca777fbbb7	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-02-13 23:15:24 +00:00
Edward Z. Yang	6665b96ebb	Rewrite maybe_reduce more carefully for unbacked SymInt (#119562 ) Fixes https://github.com/pytorch/pytorch/issues/119476 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119562 Approved by: https://github.com/albanD ghstack dependencies: #119559	2024-02-13 21:40:06 +00:00
Ke Wen	28f299a870	[c10d] Fix compilation of NCCL_EXP path (#119805 ) Fixes issue pointed out in https://github.com/pytorch/pytorch/pull/119421#issuecomment-1941694621 When refactoring ProcessGroupNCCL, some code in the NCCL_EXP path wasn't done cleanly. Cc: @kunalb @H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/119805 Approved by: https://github.com/H-Huang	2024-02-13 21:26:59 +00:00
Guilherme Leobas	3319dbcd23	Update vmap guard to avoid recompilations (#119061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061 Approved by: https://github.com/zou3519	2024-02-13 20:50:23 +00:00
Shuqiang Zhang	abadbbc4b0	[c10d][flight recorder] remove unintended assignment of entry (#119748 ) Summary: auto& entry = entries_.at(id % max_entries_); entry = entries_.at(id % max_entries_); The above line of code has unintended consequence of invoking copy/assignment of entry objects as ref itself cannot be re-assigned. Also what could cause the crash is that the entry ref could become invalid if entries_ are resized by other threads. and this could result in 'copy to a garbage location'. The fix is to use a pointer which can be re-assigned after re-acquiring the lock Tests: python test/distributed/test_c10d_nccl.py NCCLTraceTest Pull Request resolved: https://github.com/pytorch/pytorch/pull/119748 Approved by: https://github.com/wconstab, https://github.com/fegin	2024-02-13 20:18:58 +00:00
cyy	47a2e6b6b8	Fix C++20 build (#112333 ) Currently C++20 fails because of incorrect template initialization order. This PR adjusted the order of theses classes and a constructor to address the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112333 Approved by: https://github.com/albanD	2024-02-13 05:10:19 +00:00
min-jean-cho	2502a01110	Linear-BN Fusion: add precondition check (#119264 ) Fixes #118990 The root cause is due to `out_features` of Linear not matching `num_features` of BatchNorm, resulting in shape mismatch while computing `fused_w`, and `fused_b`. This can happen for linear-bn folding because linear layer operates over the last dim, `(*, H_in)`, while bn layer operates over the channel dim, `(N, C_in, H, W)`. To preserve the shapes of the original linear weight and bias in linear-bn folding, check linear `out_features` match bn `num_features`. If they don't match, bn `num_features` need to be 1 to broadcast. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119264 Approved by: https://github.com/eellison	2024-02-13 04:16:34 +00:00
PyTorch MergeBot	214f06ae3a	Revert "Add Accelerator device and shell hooks (#119329 )" This reverts commit `4b9568a360`. Reverted https://github.com/pytorch/pytorch/pull/119329 on behalf of https://github.com/huydhn due to Breaks internal build and requires OSS file update to fix it ([comment](https://github.com/pytorch/pytorch/pull/119329#issuecomment-1940278598))	2024-02-13 02:23:45 +00:00
cyy	10f3abc6b8	[DeviceIndex][3/N] Use DeviceIndex in more places (#119635 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119635 Approved by: https://github.com/ezyang	2024-02-12 21:31:27 +00:00
suo	82248f0b1c	[export] improve FakeTensor serialization (#119531 ) Recently we made it possible to serialize ExportedPrograms with fake parameters/buffers/etc. The serialization regime was kind of whacky; basically we serialized a stub and reassembled the FakeTensor using metadata that we had stashed elsewhere in the Graph state. This was bad for a few reasons: - Storing the metadata separately from the actual serialized object caused situations where you could have one but not the other. An example case is if you had a FakeTensor contained inside a TorchBind object—there was no obviously place to store the metadata for this. This actually happens—TensorQueue in fbgemm does this. - It created an annoying cycle: we had to deserialize the Graph's tensor metadata in order to deserialize (potentially faked) constants, but we need constants in order to deserialize the Graph. This fixes all that. The basic idea is to patch the reducer function for FakeTensor at serialization time, and serialize a copy of the FakeTensor metadata. We already are policing BC for the TensorMeta schema struct so it's not a net increase in the BC surface. As a bonus, I fixed a weird bug with torchbind tracing where we were accidentally reinterpreting a torch.ScriptObject as a torch.ScriptModule (which was the root cause of some weird behavior @bahuang was seeing last week). Differential Revision: [D53601251](https://our.internmc.facebook.com/intern/diff/D53601251/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119531 Approved by: https://github.com/zhxchen17	2024-02-12 19:28:08 +00:00
Edward Z. Yang	482345d747	Refactor out shape test into InputMetadata::maybe_reduce (#119559 ) I'm going to gut this function shortly, and having it all on InputMetadata is convenient for this purpose. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119559 Approved by: https://github.com/soulitzer	2024-02-12 19:27:48 +00:00
Yifu Wang	27ffede878	[reland] Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #119102	2024-02-12 18:48:06 +00:00
Ke Wen	b2043c0543	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-12 18:45:49 +00:00
Shuqiang Zhang	893dcac068	[c10d] explicitly abort communicators in destroy_process_group call (#119250 ) Summary: This PR tries to resolve issue #119215. Basically, processgroup shutdown (and hence ncclCommAbort) is called in destroy_process_group APIs for the corresponding PGs. and in the destructor of ProcessGroup, we avoid calling abort/ncclCommAbort. Instead, it just checks if the users have explicitly already called destroy_process_group. If not, Destructor will log a warning and encourage/expect users to do so for cleanup of resources of PGs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250 Approved by: https://github.com/minsii, https://github.com/kwen2501	2024-02-12 18:40:28 +00:00
Bin Bao	52a3de6cbf	[AOTI][refactor] Move ThreadLocalCachedOutputTensor into a separate header (#119392 ) Summary: Move common functionality into a separate header so that later JIT and AOT Inductor can share it. Test Plan: CI Differential Revision: D53523452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119392 Approved by: https://github.com/khabinov	2024-02-12 15:56:16 +00:00
PyTorch MergeBot	24bdd03d23	Revert "Reify view_func() closures as ViewFuncs (#118404 )" This reverts commit `d5a6762263`. Reverted https://github.com/pytorch/pytorch/pull/118404 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/118404#issuecomment-1938600260))	2024-02-12 12:38:51 +00:00
PyTorch MergeBot	0342b227e5	Revert "[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 )" This reverts commit `f3e7d80993`. Reverted https://github.com/pytorch/pytorch/pull/119421 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/119421#issuecomment-1938169747))	2024-02-12 07:34:20 +00:00
cyy	8a3c241094	Remove unused header inclusion (#119667 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/119667 Approved by: https://github.com/Skylion007	2024-02-12 05:36:25 +00:00
Mu-Chu Lee	dcb08a7044	Add CUDAEvent recording for constant folding to show up. (#119216 ) Summary: Add a layer of call to let CUDAEvent show up for constant folding. Test Plan: Existing tests Differential Revision: D53437934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119216 Approved by: https://github.com/khabinov	2024-02-12 03:46:36 +00:00
cyy	568740f080	[DeviceIndex][2/N] Use DeviceIndex instead of int in allocators (#119545 ) Follows #119142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119545 Approved by: https://github.com/ezyang	2024-02-10 20:27:59 +00:00
Mu-Chu Lee	e71c202520	Use CUDA if cuda's macro is set for AOTI runner's pybind (#119616 ) Summary: Use CUDA if cuda's macro is set for AOTI runner's pybind This is a duplicate of #119438 for landing issues Test Plan: Existing tests (D52303882) Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/119616 Approved by: https://github.com/khabinov	2024-02-10 11:00:47 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
Ke Wen	f3e7d80993	[c10d] PGNCCL refactor part 2: Simplify ProcessGroupNCCL into single-device style (#119421 ) Part 2 and last part of #118674: Introduce actual "single-device" code change to ProcessGroupNCCL. assert size == 1 and test refactor have been done in #119099. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119421 Approved by: https://github.com/shuqiangzhang	2024-02-09 20:23:20 +00:00
albanD	4b9568a360	Add Accelerator device and shell hooks (#119329 ) This adds a concept of Accelerator that points to one of our devices. See DeviceAccelerator.h in this PR for details https://github.com/pytorch/pytorch/pull/119329/files#diff-83cc748bed5df1a453c272cc5ecc7e572d4eb694c5125384d8fbd17a0b5f50c8 It also adds scaffolding for shared C++ API to allow generic feature implementation. This PR in particular updates the autograd engine to use this generic API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119329 Approved by: https://github.com/ezyang	2024-02-09 18:54:28 +00:00
Joel Schlosser	d5a6762263	Reify view_func() closures as ViewFuncs (#118404 ) Replaces `view_func()` closures with a reified `ViewFunc` data structure. Codegen generates a `ViewFunc` subclass for each view op (e.g. `NarrowViewFunc`) containing state needed to reconstruct the view. The `ViewFunc` API allows for querying and hot-swapping any `SymInt`s or `Tensors` in the state through `get_symints()` / `get_tensors()` / `clone_and_set()`, which will be essential for fake-ification later on. ```cpp /// Base class for view functions, providing reapplication of a view on a new base. /// Each view op should get a codegenerated subclass of this class containing /// any state needed to reconstruct the view. The class also provides convenience /// accessors for saved SymInts / tensor state. This is useful for e.g. fake-ification, /// where we want to use symbolic values or fake tensors instead. struct TORCH_API ViewFunc { virtual ~ViewFunc() {} /// Returns any SymInts in the saved state. virtual std::vector<c10::SymInt> get_symints() const { return {}; } /// Returns the number of SymInts in the saved state. virtual size_t num_symints() const { return 0; } /// Returns any tensors in the saved state. virtual std::vector<at::Tensor> get_tensors() const { return {}; } /// Returns the number of tensors in the saved state. virtual size_t num_tensors() const { return 0; } /// Reapplies the view on the given base using the saved state. virtual at::Tensor operator()(const at::Tensor&) const = 0; /// Returns a clone of this ViewFunc, optionally with the specified saved state. virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const = 0; protected: /// Sets the values of any SymInts in the saved state. The input vector size must /// match the number of SymInts in the saved state (i.e. the size of the list /// returned by get_symints()). virtual void set_symints(std::vector<c10::SymInt>) {} /// Sets the values of any Tensors in the saved state. The input vector size must /// match the number of Tensors in the saved state (i.e. the size of the list /// returned by get_tensors()). virtual void set_tensors(std::vector<at::Tensor>) {} }; ``` New codegen files: * `torch/csrc/autograd/generated/ViewFunc.h` * `torch/csrc/autograd/generated/ViewFuncs.cpp` The templates for these also contains impls for `ChainedViewFunc` and `ErroringViewFunc` which are used in a few places within autograd. Example codegen for `slice.Tensor`: ```cpp // torch/csrc/autograd/generated/ViewFuncs.h #define SLICE_TENSOR_VIEW_FUNC_AVAILABLE struct SliceTensorViewFunc : public torch::autograd::ViewFunc { SliceTensorViewFunc(int64_t dim, c10::optional<c10::SymInt> start, c10::optional<c10::SymInt> end, c10::SymInt step) : dim(dim), start(start), end(end), step(step) {}; virtual ~SliceTensorViewFunc() override {}; virtual std::vector<c10::SymInt> get_symints() const override; virtual size_t num_symints() const override; virtual std::vector<at::Tensor> get_tensors() const override; virtual size_t num_tensors() const override; virtual at::Tensor operator()(const at::Tensor&) const override; virtual std::unique_ptr<ViewFunc> clone_and_set( std::optional<std::vector<c10::SymInt>> = c10::nullopt, std::optional<std::vector<at::Tensor>> = c10::nullopt) const override; protected: virtual void set_symints(std::vector<c10::SymInt>) override; virtual void set_tensors(std::vector<at::Tensor>) override; private: int64_t dim; c10::optional<c10::SymInt> start; c10::optional<c10::SymInt> end; c10::SymInt step; }; ... // torch/csrc/autograd/generated/ViewFuncs.cpp std::vector<c10::SymInt> SliceTensorViewFunc::get_symints() const { ::std::vector<c10::SymInt> symints; symints.reserve((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); if(start.has_value()) symints.insert(symints.end(), (start)); if(end.has_value()) symints.insert(symints.end(), (end)); symints.push_back(step); return symints; } size_t SliceTensorViewFunc::num_symints() const { return static_cast<size_t>((start.has_value() ? 1 : 0) + (end.has_value() ? 1 : 0) + 1); } void SliceTensorViewFunc::set_symints(std::vector<c10::SymInt> symints) { TORCH_INTERNAL_ASSERT(symints.size() == num_symints()); auto i = 0; if(start.has_value()) start = symints[i]; i += (start.has_value() ? 1 : 0); if(end.has_value()) end = symints[i]; i += (end.has_value() ? 1 : 0); step = symints[i]; } std::vector<at::Tensor> SliceTensorViewFunc::get_tensors() const { ::std::vector<at::Tensor> tensors; return tensors; } size_t SliceTensorViewFunc::num_tensors() const { return static_cast<size_t>(0); } void SliceTensorViewFunc::set_tensors(std::vector<at::Tensor> tensors) { TORCH_INTERNAL_ASSERT(tensors.size() == num_tensors()); } at::Tensor SliceTensorViewFunc::operator()(const at::Tensor& input_base) const { return at::_ops::slice_Tensor::call(input_base, dim, start, end, step); } std::unique_ptr<ViewFunc> SliceTensorViewFunc::clone_and_set( std::optional<std::vector<c10::SymInt>> symints, std::optional<std::vector<at::Tensor>> tensors) const { auto output = std::make_unique<SliceTensorViewFunc>(dim, start, end, step); if (symints.has_value()) { output->set_symints(std::move((symints))); } if (tensors.has_value()) { output->set_tensors(std::move((tensors))); } return output; } ``` The `_view_func()` / `_view_func_unsafe()` methods now accept two additional (optional) args for `symint_visitor_fn` / `tensor_visitor_fn`. If these are defined, they are expected to be python callables that operate on a single SymInt / tensor and return a new one. This allows for the hot-swapping needed during fake-ification. For testing, there are extensive pre-existing tests, and I added a test to ensure that hot-swapping functions correctly. ```sh python test/test_autograd.py -k test_view_func_replay python test/test_ops.py -k test_view_replay ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118404 Approved by: https://github.com/ezyang	2024-02-09 18:51:36 +00:00
Kurt Mohler	90dabff260	Avoid COW materialize in various operations (#119506 ) Operations affected include dot, cross, scatter/gather, shape, sort, triangular, unary, scalar, pad, complex, to_list, fft Pull Request resolved: https://github.com/pytorch/pytorch/pull/119506 Approved by: https://github.com/ezyang ghstack dependencies: #119501, #119502, #119503, #119504	2024-02-09 14:47:19 +00:00
cyy	560c92c324	[DeviceIndex] Use DeviceIndex instead of int in CUDA wrappers (#119142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119142 Approved by: https://github.com/ezyang	2024-02-08 23:00:56 +00:00
Yang Chen	9f8ade04cc	[aot_inductor] replace TORCH_CHECK with AOTI_CHECK in the generate cpp code (#119220 ) In some cases where we have TORCH_CHECK in loops, it may cause the host compiler to spend hours optimizing the run_impl function. This PR mitigated the issue by replacing TORCH_CHECK with a custom AOTI_CHECK, where we force the underneath assert function to be noinline. If forcing noinline caused any serious perf regression, we could either add an option to turn on/off enable noinline. Or, we could another an option to just turn AOTI_CHECK into a no-op, similar to the ```assert``` macro from cassert. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119220 Approved by: https://github.com/hl475, https://github.com/desertfire	2024-02-08 21:57:27 +00:00
Qianli Scott Zhu	71e772f827	Update logging.cpp for explicit chrono import (#119469 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119469 Approved by: https://github.com/davidberard98	2024-02-08 21:57:23 +00:00
PyTorch MergeBot	7315ec7505	Revert "Fix estimate_nccl_collective_runtime (#118986 )" This reverts commit `0dab6fb352`. Reverted https://github.com/pytorch/pytorch/pull/118986 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/118986#issuecomment-1934680463))	2024-02-08 18:11:53 +00:00
Sheng Fu	2b9cba86cf	Fix deadlock in ExecutionTraceObserver (#119242 ) (#119398 ) Summary: With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex. This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex. Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern. Test Plan: Unit Test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D53533253 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119398 Approved by: https://github.com/aaronenyeshi	2024-02-08 18:00:51 +00:00
zdevito	7f05c72864	[nccl flight recorder] record time we discover start and complete (#119249 ) Some APIs like ncclCommAbort can cause nccl kernels to finish even if they were previously stuck. Because we can gather the trace buffer after those calls, we can end up seeing some collectives marked completed eventhough that complete happened several minutes after they started and clearly after the timeout. This changes how we record state so that we keep track of the time we discover a state change, so even if eventually the collective gets marked complete, we can observe it happened minutes after it was schedule. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119249 Approved by: https://github.com/wconstab	2024-02-08 16:48:33 +00:00
Peter Bell	08657b82f5	Reduce scope of dispatching in logcumsumexp_backward (#119397 ) Everything inside the `AT_DISPATCH` block is being compiled 5 times, so it makes sense to limit it to the only line that uses `scalar_t` which is the `numeric_limits` query. Also a small optimization, instead of computing `grad.log()` and `(-grad).log()` we can compute `grad.abs().log()` which is 2 pointwise ops instead of 3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119397 Approved by: https://github.com/lezcano, https://github.com/albanD	2024-02-08 15:09:22 +00:00
Pritam Damania	f579c65ef6	Release GIL for torch::autograd::clear_autocast_cache (#119416 ) Fixes #119262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119416 Approved by: https://github.com/albanD	2024-02-08 03:22:48 +00:00
Chien-Chin Huang	1d2382f141	[DDP] Use compiled_autograd to trace DDP backward allreduce (#110662 ) Summary The reducer of `DistributedDataParallel` is implemented with C++ and it is not easy to trace the allreduce launched in the reducer. This PR modifies `DistributedDataParallel` to launch one allreduce per gradient when `compiled_autograd` is enabled. The changes allow us to use `compiled_autograd` to trace the allreduce and later be optimized (fused) in the Inductor. Key Logic 1. If `ddp_python_hook` is True, we assume `compiled_autograd` is used. `DistributedDataParallel` registers `compiled_accum_grad_hook` for all parameters. 2. In the first forward() call, if `DistributedDataParallel` is not compiled, all `compiled_accum_grad_hook` are deregistered. If `DistributedDataParallel` is compiled, all `compiled_accum_grad_hook` will be compiled by `compiled_autograd`. 3. `compiled_accum_grad_hook` launches an allreduce to reduce the gradient of the parameter. Bucketing The compiled backward is slow because there is no bucketing for the allreduces. We rely on Inductor to bucket the allreduces. The bucketing is done in a separate PR. Differential Revision: [D49428482](https://our.internmc.facebook.com/intern/diff/D49428482/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110662 Approved by: https://github.com/wconstab	2024-02-08 03:03:15 +00:00
Yu, Guangye	9a992b0918	[4/4] Intel GPU Runtime Upstreaming for Device (#116869 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization. # Design This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability. # Additional Context We adopt a similar design to CUDA. So we share some code with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet ghstack dependencies: #119248	2024-02-08 03:01:21 +00:00
Ke Wen	029a16c41f	[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 ) Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-02-07 22:29:29 +00:00
Natalia Gimelshein	6fe5a3adaf	release GIL for cudaEventDestroy (#119393 ) cudaEventDestroy can become blocking under some circumstances, and then holding GIL will lead to deadlocks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119393 Approved by: https://github.com/Skylion007	2024-02-07 22:16:18 +00:00
albanD	a6e16fe202	Fix global in header warning (#119380 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119380 Approved by: https://github.com/janeyx99	2024-02-07 20:35:21 +00:00
Kaiming Ouyang	35aa353c48	Change watchdog log from "NCCL" to "Process group" (#118121 ) This PR changes the watchdog log. In order to avoid confusion that NCCL creates a watchdog thread and reports the error log, it is better to change "NCCL" to "Process group" to better indicate the source of the log. @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/118121 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-02-07 20:14:49 +00:00
Hirochika Matsumoto	02c24b0b5e	Add Python binding `resizable` to class `{Untyped,Typed}Storage` (#119286 ) This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users. Fixes #119233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286 Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki	2024-02-07 19:15:55 +00:00
Yifu Wang	0dab6fb352	Fix estimate_nccl_collective_runtime (#118986 ) `estimate_nccl_collective_runtime` has been broken and the errors have been silently swallowed by inductor. This PR: - Fixes the issues described in https://github.com/pytorch/pytorch/issues/118497. - Adds white-box testing so future issues can be surfaced in tests. - Add support for native funcol IRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118986 Approved by: https://github.com/yf225 ghstack dependencies: #118910, #118911, #118437	2024-02-07 18:02:51 +00:00
Bin Bao	40ec155e58	[AOTI][refactor] Split common aoti_runtime utils into a separate header (#119066 ) Summary: Split common utils from aoti_runtime/model.h into a separate header file, because when turning on ABI-compatible mode for JIT Inductor we won't need AOTInductorModel, but we do need some common utils, e.g. RAIIAtenTensorHandle. Differential Revision: [D53478809](https://our.internmc.facebook.com/intern/diff/D53478809) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119066 Approved by: https://github.com/khabinov	2024-02-07 16:54:00 +00:00
Yu, Guangye	5c46600f84	[RELAND] refactor lazy init to device-agnostic (#119248 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. This is a reland PR, the original PR is [refactor lazy init to device-agnostic](https://github.com/pytorch/pytorch/pull/118846). This is a common PR, and does not trigger xpu ciflow. Differential Revision: [D53478332](https://our.internmc.facebook.com/intern/diff/D53478332) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119248 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/atalman	2024-02-07 15:58:51 +00:00
Simon Fan	1435cfecfa	Increase accumulate_grad_ gradient's expected refcount to account for pybind (#119068 ) Account for pybind of the op holding 1 ref when torch.ops.inductor.accumulate_grad_.default is called during run time Pull Request resolved: https://github.com/pytorch/pytorch/pull/119068 Approved by: https://github.com/jansel ghstack dependencies: #118817, #119334	2024-02-07 10:25:43 +00:00
Simon Fan	8e14e1d514	Fix gradient refcounts in pybind and compiled autograd (#118817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118817 Approved by: https://github.com/jansel	2024-02-07 10:25:42 +00:00
PyTorch MergeBot	d85631b721	Revert "Fix deadlock in ExecutionTraceObserver (#119242 )" This reverts commit `6fc775ae13`. Reverted https://github.com/pytorch/pytorch/pull/119242 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/119242#issuecomment-1931445631))	2024-02-07 07:37:22 +00:00
William Wen	ee1c2449f7	[dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107 ) Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107 Approved by: https://github.com/jansel	2024-02-07 03:32:42 +00:00
Sheng Fu	6fc775ae13	Fix deadlock in ExecutionTraceObserver (#119242 ) Summary: With the compiled PyTorch module, in execution_trace_observer.cpp, function convertIValue calls TensorImpl->storage_offset(). That function call will trigger a recursive call into recordOperatorStart. It will cause a deadlock on ob.g_mutex. This DIFF is to fix this deadlock by replacing std::mutex with std::recursive_mutex. Since PyTorch only has one thread for FWD, and one thread for BWD. The contention is very low, the performance should NOT be a concern. Test Plan: Unit Test buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D53299183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119242 Approved by: https://github.com/aaronenyeshi	2024-02-06 23:36:22 +00:00
PyTorch MergeBot	9d46fe603d	Revert "[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 )" This reverts commit `4ab852b6c5`. Reverted https://github.com/pytorch/pytorch/pull/119099 on behalf of https://github.com/atalman due to Breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/119099#issuecomment-1930839754))	2024-02-06 22:14:36 +00:00
Chen_Liqing	422b4271ae	Change PrivateUse1's resize_bytes to PrivateUse1HooksInterface (#117839 ) Reopen from https://github.com/pytorch/pytorch/pull/117211 Modify the logic for entering the registration branch so that existing uts are not affected. Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117839 Approved by: https://github.com/albanD	2024-02-06 20:51:56 +00:00
William Wen	ae4e866bba	[dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438 ) Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090. Changes: - Move CacheEntry and ExtraState to C++ - Use pybind to control reference counting - Use std::list instead of manually implementing a linked list Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438 Approved by: https://github.com/jansel	2024-02-06 20:48:11 +00:00
Edward Z. Yang	3f0fd36835	Introduce size oblivious guards (#118579 ) Fixes https://github.com/pytorch/pytorch/issues/117361 The implementation here slightly diverges from what was proposed in the issue, so I will recap what this PR is doing here. Today, when doing computations involving size-like unbacked SymInts, we assume for all operations that the compile time range of the integer is `[2, inf]`, even though at runtime we also accept zero and one. This PR removes the carte blanche assumption, and instead does the analysis in a much more limited and controlled fashion: only for guards which we have designated as "size oblivious" are we willing to do the analysis under the assumption that the range of all size-like unbacked SymInts is `[2, inf]`; otherwise, we will faithfully only do analysis with `[0, inf]` (or whatever the user provided) bounds. The infra pieces of this PR are: * Remove runtime_var_to_range from torch/fx/experimental/symbolic_shapes.py; modify `_constrain_range_for_size` to refine the range without clamping min to 2, and instead add the symbol to a `size_like` set in the ShapeEnv * When evaluating an expression, if the expression is requested to be evaluated in a `size_oblivious` way, we attempt to statically compute the value of the expression with the assumption that all symbols in `size_like` are updated to assume that they are `>= 2`. * Add Python and C++ APIs for guarding on a SymBool in a size-oblivious way. In C++, I also need to add some helpers for performing symbolic comparisons, since the stock comparisons immediately specialize in the "normal" way. The rest of the changes of the PR are marking various spots in PyTorch framework code as size oblivious, based on what our current test suite exercises. As you review the places where we have marked things as size oblivious, it may become clear why I ended up not opting for the "designate a branch as the default branch when it's not statically obvious which way to go": for some of the conditions, this answer is rather non-obvious. I think potentially there is another refinement on top of this PR, which is something like "I don't care if you can't figure it out with ValueRange analysis, go down this path anyway if there are unbacked sizes involved." But even if we add this API, I think we are obligated to attempt the ValueRange analysis first, since it can lead to better outcomes sometimes (e.g., we are able to figure out that something is contiguous no matter what the unbacked size is.) When is it permissible to mark something as size oblivious? Heuristically, it is OK anywhere in framework code if it gets you past a guard on unbacked SymInt problem. It is somewhat difficult to provide a true semantic answer, however. In particular, these annotations don't have any observational equivalence guarantee; for example, if I have `torch.empty(u0, 1).squeeze()`, we will always produce a `[u0]` size tensor, even though if `u0 == 1` PyTorch will actually produce a `[]` size tensor. The argument that I gave to Lezcano is that we are in fact defining an alternate semantics for a "special" size = 0, 1, for which we have these alternate eager mode semantics. In particular, suppose that we have a constant `special1` which semantically denotes 1, but triggers alternate handling rules. We would define `torch.empty(special1, 1).squeeze()` to always produce a `[special1]` size tensor, making its semantics coincide with unbacked SymInt semantics. In this model, the decision to designate guards as size oblivious is simply a user API question: you put them where ever you need some handling for special1! As we conservatively error out whenever it is not obvious what `special1` semantics should be, it is always valid to expand these semantics to cover more cases (although you can always choose the wrong semantics!) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118579 Approved by: https://github.com/eellison, https://github.com/lezcano	2024-02-06 19:45:32 +00:00
Ke Wen	4ab852b6c5	[c10d] PGNCCL refactor part 1: adds assert size==1 (#119099 ) Breaking #118674 into multiple smaller PRs. This is the first one. It adds `assert size==1` to PGNCCL, and refactors some old tests written in multi-device style (which would otherwise fail at the assert). Pull Request resolved: https://github.com/pytorch/pytorch/pull/119099 Approved by: https://github.com/wconstab	2024-02-06 06:59:47 +00:00
Yifu Wang	5086e1cf3f	Remove distributed/c10d/Functional.hpp (#119138 ) This file is useless and was accidentally checked in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119138 Approved by: https://github.com/Skylion007	2024-02-05 21:58:08 +00:00
PyTorch MergeBot	ab613a4019	Revert "refactor lazy init to device-agnostic (#118846 )" This reverts commit `520771d7b3`. Reverted https://github.com/pytorch/pytorch/pull/118846 on behalf of https://github.com/atalman due to Failing, tests https://github.com/pytorch/torchdistx/blob/main/src/python/torchdistx/_C/fake.cc#L11 ([comment](https://github.com/pytorch/pytorch/pull/118846#issuecomment-1927651305))	2024-02-05 18:06:30 +00:00
Bin Bao	79b20aec76	[AOTI] Support copy_, _fft_c2c and view_as_real in C shim (#119125 ) Summary: These ops exist in GoogleFnet. Also add a Complex fallback for convert_element_type. After this PR, we can enable ABI-compatible for AOTInductor OSS CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119125 Approved by: https://github.com/chenyang78	2024-02-04 15:48:58 +00:00
Yifu Wang	372e9550bd	ProcessGroupGloo::reduce_scatter_tensor_coalesced (#118911 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. ### This PR We already have a fallback impl for `_reduce_scatter_base`, which is composed from all-reduce + scatter. The scatter was not necessary. It introduces extra communication, sync point, and forced the impl to fail on `asyncOp=True`. This PR does the following: - Simulate reduce-scatter with `allreduce(inp).chunk(world_size)[rank]`. This is still 2x communication than a real reduce-scatter (since all-reduce = reduce-scatter + all-gather), but it's strictly better than what we have now. - By doing the above, the comm becomes async and we don't have to fail on `asyncOp=True`. - The general logic is implemented in `reduce_scatter_tensor_coalesced`. `_reduce_scatter_base` just calls it with single input/output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118911 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118910	2024-02-03 02:42:47 +00:00
lancerts	857508fa36	Change the internal assert to torch_check in torch::nn::functional::InterpolateFuncOptions (#117831 ) Fixes #117333 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117831 Approved by: https://github.com/malfet	2024-02-03 02:15:11 +00:00
briancoutinho	d91d21fd6f	[submodule kineto] Enable profiler connection to daemon during init for cpu only jobs (#118320 ) Fixes #112389 and https://github.com/facebookincubator/dynolog/issues/208 This PR enables profiler initialization for CPU only use cases. The main goal is to enable on-demand profiling with a daemon when using CPU only mode of PyTorch. * When CUDA is available the profiler is initialized on first CUDA stream creation (or lazily when profiler is run). * Since the CUDA stream creation callback does not exist on CPU only PyTorch the profiler is never initied on its own. * Thus the job does not register with Dynolog when we set "KINETO_USE_DAEMON" env variable to set. Part of the fix is in Kineto https://github.com/pytorch/kineto/pull/861, we point to it in PyTorch. The change in PyTorch is to correctly set the `cpuOnly` argument. ## TestPlan: Build PyTorch from source with USE_CUDA=0 so we have CPU only based build. Git hash = `a40951defd87b9a5e582cf9112bf7a8bd0930c79` (See instructions in PyTorch repo) For the setup we run dynolog daemon in another terminal ``` buck2 run dynolog/src:dynolog -- --enable_ipc_monitor & ``` Now run an example model in PyTorch - see [linear_model.py](https://github.com/facebookincubator/dynolog/blob/main/scripts/pytorch/linear_model_example.py) , and set the device to 'cpu' inside the code instead of 'cuda'. ``` export KINETO_USE_DAEMON=1 python linear_model_example.py ``` Output shows the profiler registration with dynolog ``` (pytorch) [bcoutinho@devgpu038.ftw6 ~/local/pytorch (main)]$ python linear_model_example.py INFO:2024-01-25 11:08:53 1807792:1807792 init.cpp:122] Registering daemon config loader, cpuOnly = 1 INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-01-25 11:08:53 1807792:1807792 IpcFabricConfigClient.cpp:93] Setting up IPC Fabric at endpoint: dynoconfigclient0dc36b8a-e14c-4260-958b-4b2e7d15e986 status = initialized INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 INFO:2024-01-25 11:08:53 1807792:1807792 DaemonConfigLoader.cpp:63] Setting communication fabric enabled = 1 ``` We can also collect a trace using ``` [bcoutinho@devgpu038.ftw6 ~/fbsource/fbcode (3bc85f968)]$ buck2 run dynolog/cli:dyno -- gputrace --log-file /tmp/test.json Kineto config = ACTIVITIES_LOG_FILE=/tmp/test.json PROFILE_START_TIME=0 ACTIVITIES_DURATION_MSECS=500 PROFILE_REPORT_INPUT_SHAPES=false PROFILE_PROFILE_MEMORY=false PROFILE_WITH_STACK=false PROFILE_WITH_FLOPS=false PROFILE_WITH_MODULES=false response length = 147 response = {"activityProfilersBusy":0,"activityProfilersTriggered":[1807792],"eventProfilersBusy":0,"eventProfilersTriggered":[],"processesMatched":[1807792]} Matched 1 processes Trace output files will be written to: /tmp/test_1807792.json ``` And trace file contains the trace correctly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118320 Approved by: https://github.com/aaronenyeshi	2024-02-03 01:40:56 +00:00
willfengg	63fd6883fd	[c10d] logging utility for cpp-python stacktrace (#118924 ) user may not know which line of code called collectives in a big code base. When debugging, we can print python-cpp stacktrace in case user call ``ProcessGroup.reduce`` instead of ``torch.distributed.reduce`` ``` LOG(INFO) << "ProcessGroupNCCL::_allgather_base stacktrace: " << get_python_cpp_trace(); ``` output (using _allgather_base as an example): one example python-part trace is ``all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838`` ``` ProcessGroupNCCL::_allgather_base stacktrace: #0 torch::unwind::unwind() from ??:0 #1 torch::CapturedTraceback::gather(bool, bool, bool) from ??:0 #2 c10d::get_python_cpp_trace[abi:cxx11]() from :0 #3 c10d::ProcessGroupNCCL::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from ??:0 #4 c10d::ops::(anonymous namespace)::_allgather_base_CUDA(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long) from Ops.cpp:0 #5 c10::impl::make_boxed_from_unboxed_functor<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > > ()(at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long), std::tuple<at::Tensor, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > >, c10::guts::typelist::typelist<at::Tensor&, at::Tensor&, c10::intrusive_ptr<c10d::ProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::ProcessGroup> > const&, bool, long> >, false>::call(c10::OperatorKernel, c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from :0 #6 torch::autograd::basicAutogradNotImplementedFallbackImpl(c10::OperatorHandle const&, c10::DispatchKeySet, std::vector<c10::IValue, std::allocator<c10::IValue> >) from autograd_not_implemented_fallback.cpp:0 #7 c10d::ProcessGroup::_allgather_base(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&) from :0 #8 pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(pybind11::cpp_function::initialize<c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> >, c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&, pybind11::name, pybind11::is_method, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v, pybind11::call_guard<pybind11::gil_scoped_release> >(c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > (c10d::ProcessGroup::)(at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&)#1}&&, c10::intrusive_ptr<c10d::Work, c10::detail::intrusive_target_default_null_type<c10d::Work> > ()(c10d::ProcessGroup, at::Tensor&, at::Tensor&, c10d::AllgatherOptions const&), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&, pybind11::call_guard<pybind11::gil_scoped_release> const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from :0 #9 pybind11::cpp_function::dispatcher(_object, _object, _object) from :0 #10 cfunction_call from /usr/local/src/conda/python-3.10.12/Objects/methodobject.c:543 #11 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #12 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #13 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #14 all_gather_into_tensor from /data/users/weif/pytorch/torch/distributed/distributed_c10d.py:2838 #15 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #16 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #17 wrapper from /data/users/weif/pytorch/torch/distributed/c10d_logger.py:75 #18 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #19 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #20 _all_gather_flat_param from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1399 #21 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #22 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #23 unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_flat_param.py:1308 #24 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #25 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #26 _unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:332 #27 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #28 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #29 _pre_forward_unshard from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:448 #30 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #31 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #32 _pre_forward from /data/users/weif/pytorch/torch/distributed/fsdp/_runtime_utils.py:413 #33 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #34 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #35 forward from /data/users/weif/pytorch/torch/distributed/fsdp/fully_sharded_data_parallel.py:839 #36 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #37 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #38 _call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1520 #39 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #40 do_call_core from /usr/local/src/conda/python-3.10.12/Python/ceval.c:5945 #41 _wrapped_call_impl from /data/users/weif/pytorch/torch/nn/modules/module.py:1511 #42 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #43 _PyObject_Call_Prepend from /usr/local/src/conda/python-3.10.12/Objects/call.c:431 #44 slot_tp_call from /usr/local/src/conda/python-3.10.12/Objects/typeobject.c:7494 #45 _PyObject_MakeTpCall from /usr/local/src/conda/python-3.10.12/Objects/call.c:215 #46 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:112 #47 inner from /data/users/weif/pytorch/run_fsdp.py:72 #48 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #49 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #50 run from /data/users/weif/pytorch/run_fsdp.py:76 #51 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #52 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #53 main from /data/users/weif/pytorch/run_fsdp.py:133 #54 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #55 _PyObject_VectorcallTstate from /usr/local/src/conda/python-3.10.12/Include/cpython/abstract.h:114 #56 <module> from /data/users/weif/pytorch/run_fsdp.py:137 #57 _PyEval_EvalFrame from /usr/local/src/conda/python-3.10.12/Include/internal/pycore_ceval.h:46 #58 PyEval_EvalCode from /usr/local/src/conda/python-3.10.12/Python/ceval.c:1134 #59 run_eval_code_obj from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1291 #60 run_mod from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1312 #61 pyrun_file from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:1208 #62 _PyRun_SimpleFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:456 #63 _PyRun_AnyFileObject from /usr/local/src/conda/python-3.10.12/Python/pythonrun.c:90 #64 pymain_run_file_obj from /usr/local/src/conda/python-3.10.12/Modules/main.c:357 #65 Py_BytesMain from /usr/local/src/conda/python-3.10.12/Modules/main.c:1090 #66 __libc_start_call_main from ??:0 #67 <unwind unsupported> from ??:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118924 Approved by: https://github.com/kwen2501	2024-02-02 23:49:18 +00:00
titaiwangms	a3cec6a7fa	[ONNX] Eliminate redundant TODOs (#119060 ) Remove titaiwangms/AllenTiTaiWang/titaiwang created TODOs: 1. Resolved TODOs 2. Turned TODOs to NOTEs if they are not actionable 3. Merge duplicated TODOs Pull Request resolved: https://github.com/pytorch/pytorch/pull/119060 Approved by: https://github.com/kit1980, https://github.com/thiagocrepaldi	2024-02-02 23:37:52 +00:00
Yifu Wang	fd000340fd	ProcessGroupGloo::allgather_into_tensor_coalesced (#118910 ) ### Motivation Despite our plan to reduce gloo usage, it is still being widely used as testing tool (in both the PyTorch CI and user tests) for code that only uses nccl in real world scenario. There's some coverage issues around all-gather and reduce-scatter variants, which are currently worked around in ugly ways (e.g. [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L216-L219)`) and [this](`b9e86bc93d/torch/distributed/_functional_collectives_impl.py (L262-L272)`)). For native funcol I ran into the same issues but I'd rather just fix the coverage. I think it's reasonable to think of this as a fix rather than adding new features. This is orthogonal to the potential reduction of gloo usage. ### This PR This PR adds `ProcessGroupGloo::allgather_into_tensor_coalesced`. This is very straightforward - `ProcessGroupGloo` already supports `allgather_coalesced`, to which we can funnel `allgather_into_tensor_coalesced`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118910 Approved by: https://github.com/shuqiangzhang	2024-02-02 17:53:28 +00:00
Yu, Guangye	520771d7b3	refactor lazy init to device-agnostic (#118846 ) # Motivation This PR intends to extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for CUDA while maintaining scalability. # Design We maintain a flag for each backend to manage the lazy initialization state separately. # Additional Context No need more UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118846 Approved by: https://github.com/malfet	2024-02-02 12:10:39 +00:00
albanD	54668ad6dc	Cleanup max cuda device (#118779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118779 Approved by: https://github.com/ezyang	2024-02-01 21:11:28 +00:00
mantaionut	b0e65dd1b4	Fix TCP Store Windows (#118860 ) In https://github.com/pytorch/pytorch/pull/107607 there was added a new Validate flow, however on Windows it was not calling addMiscellaneousSocket. Added missing call to addMiscellaneousSocket on Windows. Fixes #118737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118860 Approved by: https://github.com/awgu, https://github.com/malfet	2024-02-01 18:46:18 +00:00
garfield1997	ff9ce94489	Create empty host tensor for privateuseone (#118854 ) For the H2D copy of local_used_map_ on the privateuseone device, reuse the CUDA logic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118854 Approved by: https://github.com/ezyang	2024-02-01 15:32:55 +00:00
Yu, Guangye	a205e7bf56	[3/4] Intel GPU Runtime Upstreaming for Device (#116850 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`. # Design This PR primarily offers device-related APIs in python frontend, including - `torch.xpu.is_available` - `torch.xpu.device_count` - `torch.xpu.current_device` - `torch.xpu.set_device` - `torch.xpu.device` - `torch.xpu.device_of` - `torch.xpu.get_device_name` - `torch.xpu.get_device_capability` - `torch.xpu.get_device_properties` - ==================== - `torch.xpu._DeviceGuard` - `torch.xpu._is_compiled` - `torch.xpu._get_device` # Additional Context We will implement the support of lazy initialization in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-01 12:31:26 +00:00
Michael Suo	eaa45f47f8	[sigmoid] fix for torchbind serialization (#118791 ) Summary: There is an annoying inconsistency in how we pickle custom objs. `torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u. This serializes in a different format than TorchScript does, which uses the TS C++ pickler. The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work. Test Plan: ran SherlockNoMad's repro ``` buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG ``` Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs. Reviewed By: SherlockNoMad Differential Revision: D53248454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791 Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov	2024-02-01 10:09:07 +00:00
Mu-Chu Lee	2b48891e62	[AOTInductor] Add Runtime Constant-folding for AOTInductor (#118765 ) Summary: Add Runtime Constant-folding for AOTInductor. This also include the invocation of constant folding at load time. The constant folding lowering is a 2-step process. First, we split the graph into 2 modules, one of it is the constant module, which doesn't depend on any input and the whole module could be inferred (constant-folded) one-time and be reused. The constant module, is lowered, and being codegen-ed as usual and cached (let's call this constant code). The constant code reuses the whole lowering/profiling/etc. process, only difference is that we do not generate any headers or initialization for the constant code. Second, after handling the constant module, we take care of the main module (which is the part that would depend on the user input.) For the main module, we take in one additional component, the constant code, compare with a normal lowering. Addition step we do here is that, we inject the constant code into the codegen-ed main module, and create the caller for the main module to consume the result of the constant module. Test Plan: Unit tests included in commit. Differential Revision: D53274382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118765 Approved by: https://github.com/chenyang78	2024-02-01 04:54:25 +00:00
Andrew Calvano	649f2e3000	Fix for out of bounds registers_ access in mobile TorchScript interpreter (#110300 ) Summary: The TorchScript interpreter had multiple opcodes whose logic had the potential to access the registers_ array out of bounds. This change ensures that all registers_ accesses are in bounds or an exception will be thrown. Test Plan: contbuild + OSS signals Differential Revision: D49748737 Pull Request resolved: https://github.com/pytorch/pytorch/pull/110300 Approved by: https://github.com/malfet, https://github.com/kimishpatel	2024-01-31 19:40:02 +00:00
Shan19900305	99b69e1ffb	add PrivateUse1 device support in function options_from_string. (#118627 ) add PrivateUse1 device support in function options_from_string. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118627 Approved by: https://github.com/soulitzer	2024-01-31 18:52:58 +00:00
Bin Bao	1128cf96f0	[AOTI] Support _embedding_bag in C shim (#118706 ) Summary: At some point I will stop manually adding ops to C shim, but use torchgen to generate those code. For the near term, I need to add a few more in order to switch the AOTInductor dashboard run. Differential Revision: [D53249074](https://our.internmc.facebook.com/intern/diff/D53249074) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118706 Approved by: https://github.com/frank-wei, https://github.com/aakhundov ghstack dependencies: #118704, #118705	2024-01-31 15:02:40 +00:00
Bin Bao	8db8ff652c	[AOTI] Add aoti_torch_view_dtype in C shim (#118705 ) Summary: Support ir.ComplexView in the ABI-compatible codegen Differential Revision: [D53249039](https://our.internmc.facebook.com/intern/diff/D53249039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118705 Approved by: https://github.com/frank-wei ghstack dependencies: #118704	2024-01-31 14:42:29 +00:00
cyy	4a019047ad	Enable nested namespace check in clang-tidy (#118506 ) It is time to enable nested namespaces in the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118506 Approved by: https://github.com/albanD	2024-01-31 00:32:35 +00:00
Shuqiang Zhang	e180218949	[c10d] Log the last enqueued and completed collective (#118582 ) Summary: During debugging of some timeouted jobs, I found it difficult to identify which rank is at fault eventhough we have logs of many ranks reporting timeout on a specific collective seq. If we can also report lastEqueuedSeq and lastCompletedSeq, it would be much easier to identify, 1. whether a rank has not even join a collective call (not enqueued) 2. Or it has joined the collective call, but not completed. For the 1st case, it is mostly likely users code problem for the 2ed case, it could be lower-layer issues Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/118582 Approved by: https://github.com/wconstab	2024-01-30 20:13:55 +00:00
garfield1997	fbf92500fb	enable privateuseone to perform streaming backward (#117111 ) Fixes #116957 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117111 Approved by: https://github.com/soulitzer	2024-01-30 15:13:31 +00:00
Yifu Wang	64efec9953	Port FakeProcessGroup to cpp (#118426 ) ### Summary Native functional collective ops requires the backend to be implemented in C++. Porting `FakeProcessGroup` to cpp so that it can also work for native functional collective ops. The existing tests involving `FakeProcessGroup` all pass. In addition, `DeviceMeshTest::test_fake_pg_device_mesh` now pass with `_USE_NATIVE_C10D_FUNCTIONAL=1`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118426 Approved by: https://github.com/wanchaol ghstack dependencies: #113057	2024-01-30 11:40:13 +00:00
Shuqiang Zhang	c7af626a26	[c10d] allow nonblocking wrap of ncclCommInitRankConfig (#118256 ) resolve #117749 Summary: Updated the PR with the following intentions: 1. identify eagerMode init (as opposed to lazy init), in which case we will create NCCL comms without guarantees that they are fully initialized if NONBLOCKING mode is also enabled. 2. Python users can do their other works (e.g., model init) between invoking init_process_group and their first collective call. 3. c10D would guarantee/wait for communicators to be initialized before issuing the first collective call. 4. For NCCL collective calls, the contract between python users and c10d is not changed much from blocking calls (C10d would wait the NCCL call to be ncclSuccess, or timeout, whichever happens first). Pull Request resolved: https://github.com/pytorch/pytorch/pull/118256 Approved by: https://github.com/kwen2501	2024-01-30 06:23:20 +00:00
Yifu Wang	b778f44e97	Allow using native c10d_functional via _functional_collectives (#113057 ) This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification. NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057 Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol	2024-01-30 02:34:25 +00:00
PyTorch MergeBot	fb11354594	Revert "[c10d] relax the nccl error check for nonblocking mode (#118254 )" This reverts commit `993e4f3911`. Reverted https://github.com/pytorch/pytorch/pull/118254 on behalf of https://github.com/clee2000 due to has internal failures D53170606 ([comment](https://github.com/pytorch/pytorch/pull/118254#issuecomment-1915267786))	2024-01-29 17:56:40 +00:00
Wenyin Fu	65f8276bc6	add an option to specify custom addr2line binary (#118328 ) There is a need for users to pick their own addr2line binary in their deployment due to reasons like default addr2line being too slow etc... This option would allow user quickly experiment other alternatives. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118328 Approved by: https://github.com/zdevito, https://github.com/aaronenyeshi	2024-01-29 16:36:38 +00:00
Will Constable	5f59d0c748	[C10D] Disarm PGNCCL Heartbeat Monitor to gather data (#118344 ) Summary: Leave monitoring thread 'running' in log-only mode. Use the kill logs to correlate with actual job outcomes (e.g. does stuck job detector agree?) Later, re-enable (using a justknobs knob this time) Test Plan: CI Differential Revision: D53108142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118344 Approved by: https://github.com/shuqiangzhang, https://github.com/yifuwang, https://github.com/malfet, https://github.com/kwen2501	2024-01-29 06:09:36 +00:00
eqy	8d790abab9	[NCCL][c10d] Log failing pointer if deregistration fails (#118455 ) For debugging convenience CC @minsii @Aidyn-A @syed-ahmed @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/118455 Approved by: https://github.com/wconstab	2024-01-27 11:03:02 +00:00
PyTorch MergeBot	dabb90f2a4	Revert "[Exception] [6/N] Remove use of torch::TypeError (#117964 )" This reverts commit `87335fabae`. Reverted https://github.com/pytorch/pytorch/pull/117964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/117964#issuecomment-1913079096))	2024-01-27 08:44:34 +00:00
Shuqiang Zhang	993e4f3911	[c10d] relax the nccl error check for nonblocking mode (#118254 ) resolve https://github.com/pytorch/pytorch/issues/117749 Summary: This is the first step to enable NCCL nonblocking mode. In NCCL nonblocking mode, ncclInProgress is an expected return value when checking communicators. Without this relaxation, watchdog thread would throw NCCL errors during work checking while it is expected. Test Plan: Set nonblocking mode in unit tests, and make sure all existing NCCL tests pass Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/118254 Approved by: https://github.com/kwen2501	2024-01-27 03:49:00 +00:00
David Berard	40c08795b0	[JIT] python IR bindings: consolidate tests, add short docs in OVERVIEW.md (#118319 ) Document the existence of python IR bindings; quick comments about it; and consolidate tests in one file to serve as examples to users. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118319 Approved by: https://github.com/eellison	2024-01-27 03:11:51 +00:00
Edward Z. Yang	9bce208dfb	Replace follow_imports = silent with normal (#118414 ) This is a lot of files changed! Don't panic! Here's how it works: * Previously, we set `follow_imports = silent` for our mypy.ini configuration. Per https://mypy.readthedocs.io/en/stable/running_mypy.html#follow-imports, what this does is whenever we have an import to a module which is not listed as a file to be typechecked in mypy, we typecheck it as normal but suppress all errors that occurred in that file. * When mypy is run inside lintrunner, the list of files is precisely the files covered by the glob in lintrunner.toml, but with files in excludes excluded. * The top-level directive `# mypy: ignore-errors` instructs mypy to typecheck the file as normal, but ignore all errors. * Therefore, it should be equivalent to set `follow_imports = normal`, if we put `# mypy: ignore-errors` on all files that were previously excluded from the file list. * Having done this, we can remove the exclude list from .lintrunner.toml, since excluding a file from typechecking is baked into the files themselves. * torch/_dynamo and torch/_inductor were previously in the exclude list, because they were covered by MYPYINDUCTOR. It is not OK to mark these as `# mypy: ignore-errors` as this will impede typechecking on the alternate configuration. So they are temporarily being checked twice, but I am suppressing the errors in these files as the configurations are not quite the same. I plan to unify the configurations so this is only a temporary state. * There were some straggler type errors after these changes somehow, so I fixed them as needed. There weren't that many. In the future, to start type checking a file, just remove the ignore-errors directive from the top of the file. The codemod was done with this script authored by GPT-4: ``` import glob exclude_patterns = [ ... ] for pattern in exclude_patterns: for filepath in glob.glob(pattern, recursive=True): if filepath.endswith('.py'): with open(filepath, 'r+') as f: content = f.read() f.seek(0, 0) f.write('# mypy: ignore-errors\n\n' + content) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118414 Approved by: https://github.com/thiagocrepaldi, https://github.com/albanD	2024-01-27 02:44:11 +00:00
lancerts	af1338bfbf	fix escape nested comments in C++ (#117882 ) Fixes #115243, as it is tricky to deal with the nested comment in doxygen + sphinx. Change 6 below is adopted as the fix. All other changes do not work. After adopting change 6, realize the original `torch::optim::SGD sgd(0.9);` is not the correct call to the sgd constructor, modified to the correct one `torch::optim::SGD sgd(model->parameters(), 0.9);` - Original in [link](https://pytorch.org/cppdocs/api/function_namespacetorch_1ad98de93d4a74dd9a91161f64758f1a76.html#exhale-function-namespacetorch-1ad98de93d4a74dd9a91161f64758f1a76): `/// torch::optim::SGD sgd(/lr=/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/0054b355-4925-4112-93b4-9385fdc34bb9) - Change 1, this solution is referenced from [here](https://stackoverflow.com/questions/24978463/doxygen-escape-nested-comments-in-c): `/// torch::optim::SGD sgd(/&zwj;* lr= &zwj;/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/77ff2d18-3097-4265-8dcd-31d78acb9c6e) - Change 2: `/// torch::optim::SGD sgd(/ lr= // 0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/b520f8de-ead7-4009-b0fb-f4517daba077) - Change 3: `/// torch::optim::SGD sgd(/\lr=\/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/07e9e608-4640-43c0-994a-37983b803003) - Change 4: `/// torch::optim::SGD sgd(/&lowast; lr= &lowast;/0.9);` ![image](https://github.com/pytorch/pytorch/assets/7495155/121e55c5-0802-4ff3-bbd7-3521e1299d94) - Change 5: ``` /// \rst /// .. code-block:: cpp /// /// torch::nn::Linear model(3, 4); /// torch::load(model, "model.pt"); /// \verbatim /// torch::optim::SGD sgd(/lr=/0.9); /// \endverbatim /// std::istringstream stream("..."); /// torch::load(sgd, stream); /// /// auto tensor = torch::ones({3, 4}); /// torch::load(tensor, "my_tensor.pt"); /// \endrst ``` ![image](https://github.com/pytorch/pytorch/assets/7495155/e675f551-e939-4be8-b24a-e2e53377dd08) - Change 6: `/// torch::optim::SGD sgd(0.9); // 0.9 is the learning rate` ![image](https://github.com/pytorch/pytorch/assets/7495155/ecf0adc4-9b0b-4aef-b0bc-72d4b17c45fa) ![image](https://github.com/pytorch/pytorch/assets/7495155/01bf5d5b-8450-4599-8c9a-00204ab56119) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117882 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-01-27 02:37:23 +00:00
Min Si	838d3620cd	[NCCL PG] log NCCL comm at creation and abort (#118335 ) Summary: It helps correlate NCCL PG with corresponding NCCL comm in separate logs. Differential Revision: D53107647 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118335 Approved by: https://github.com/wconstab	2024-01-27 01:43:53 +00:00
rzou	b256b7b348	Add way to actually delete a torch.library.Library object (#118318 ) Relying on object lifetimes in Python is a bad idea due to reference cycles. Previously, when a torch.library.Library object gets destroyed, it clears all the registrations associated with it, but it's unclear when it actually gets destroyed due to the existence of refcycles. This PR: - adds torch::Library::clear(), which deterministically releases all of the RAII registration handles of the torch::Library object - adds a new `torch.library._scoped_library` context manager, which creates a library and cleans it up at the end of the scope using the previous item. All tests (unless they already handle library lifetimes) should use this new API - Rewrites some flaky tests to use `_scoped_library`. In the future we'll probably migrate all of our torch.library tests to use `_scoped_library`, but that's kind of annoying because we have multiple thousands of LOC I'm hoping this will deflake those tests; we'll see. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118318 Approved by: https://github.com/albanD	2024-01-26 22:30:51 +00:00
Thiago Crepaldi	939008a268	Fix RuntimeError: NYI: Named tensors are not supported with the tracer (#118393 ) This PR relands #108238 that was closed as stale due to CLA issues and also because the CI check has marked the PR as not mergeable. Repro 1: ```python import torch def f(x): return x[x > 0] jf = torch.jit.trace(f, torch.tensor(2., device="cuda")) ``` Error: ```bash Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/opt/pytorch/torch/jit/_trace.py", line 874, in trace traced = torch._C._create_function_from_trace( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<stdin>", line 2, in f RuntimeError: NYI: Named tensors are not supported with the tracer ``` Repro2: ```python import torch import torch.nn.functional as F from torch import nn import copy class Net(nn.Module): def __init__(self): super().__init__() def forward(self, inputs): x = copy.deepcopy(inputs) # RuntimeError: NYI: Named tensors are not supported with the tracer x = F.relu(x) return x model = Net() images = torch.randn(8, 28, 28) torch.jit.trace(model, images) ``` Error 2: ```bash Traceback (most recent call last): File "/opt/pytorch/test_deepcopy.py", line 18, in <module> File "/opt/pytorch/torch/jit/_trace.py", line 806, in trace return trace_module( ^^^^^^^^^^^^^ File "/opt/pytorch/torch/jit/_trace.py", line 1074, in trace_module module._c._create_method_from_trace( File "/opt/pytorch/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/nn/modules/module.py", line 1501, in _slow_forward result = self.forward(input, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/test_deepcopy.py", line 12, in forward x = F.relu(x) ^^^^^^^^^^ File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy y = copier(memo) ^^^^^^^^^^^^ File "/opt/pytorch/torch/_tensor.py", line 122, in __deepcopy__ new_storage = self._typed_storage()._deepcopy(memo) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 847, in _deepcopy return self._new_wrapped_storage(copy.deepcopy(self._untyped_storage, memo)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/conda/envs/ptca/lib/python3.11/copy.py", line 153, in deepcopy y = copier(memo) ^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 112, in __deepcopy__ new_storage = self.clone() ^^^^^^^^^^^^ File "/opt/pytorch/torch/storage.py", line 126, in clone return type(self)(self.nbytes(), device=self.device).copy_(self) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: NYI: Named tensors are not supported with the tracer ``` ---- #48054 RuntimeError: NYI: Named tensors are not supported with the tracer #49538 jit tracer doesn't work with unflatten layer #31591 when i try to export a pytorch model to ONNX, got RuntimeError: output of traced region did not have observable data dependence with trace inputs; this probably indicates your program cannot be understood by the tracer. - This bug was closed but exists. Multiple comments on it still showing error. This is addressed Likely fixes the following issues (but untested) #63297 Named tensor in tracer #2323 [Bug] torch.onnx.errors.UnsupportedOperatorError when convert mask2former to onnx Fix zero dimensioned tensors when used with jit.trace They are currently assigned an empty set for names {} this is not the same as "no name" so jit.trace bails with "NYI: Named tensors are not supported with the tracer" This happens when I am trying to save a non-trivial model as onnx but the simplest repro I have seen is 48054 above which has been added as test/jit/test_zero_dim_tensor_trace.py Test plan: New unit test added Broken scenarios tested locally CI Fixes #48054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118393 Approved by: https://github.com/zou3519	2024-01-26 19:31:23 +00:00
cyy	6da0e7f84b	[Clang-tidy header][17/N] Apply clang-tidy on headers in torch/csrc/cuda (#117829 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117829 Approved by: https://github.com/albanD	2024-01-26 13:33:24 +00:00
Bin Bao	4e456fd95b	[AOTI] Support scalar to tensor in the ABI-compatible mode (#118024 ) Differential Revision: [D53019485](https://our.internmc.facebook.com/intern/diff/D53019485) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118024 Approved by: https://github.com/ezyang	2024-01-26 03:15:05 +00:00
Jason Ansel	2de24c11f6	[inductor] Slightly faster memory allocation on CUDA (#118255 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118255 Approved by: https://github.com/peterbell10 ghstack dependencies: #118065, #118070, #118171	2024-01-25 20:49:14 +00:00
Jason Ansel	817debeb89	[inductor] Slightly faster memory allocation on CPU (#118171 ) Based on `python benchmarks/dynamo/microbenchmarks/overheads.py`: - Before `12.2us` - After `10.5us` This is inspired by `a2c17a2b00` -- but in Python rather than C++ Pull Request resolved: https://github.com/pytorch/pytorch/pull/118171 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #118065, #118070	2024-01-25 16:54:57 +00:00
Bin Bao	ee1dbb2acf	[AOTI] Fix a None as index codegen issue (#118187 ) Summary: Fix a ABI-compatible codegen issue when index_put has None in its indices. Differential Revision: [D53047489](https://our.internmc.facebook.com/intern/diff/D53047489) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118187 Approved by: https://github.com/chenyang78 ghstack dependencies: #118168, #118169	2024-01-25 11:53:44 +00:00
Bin Bao	d1e661a1ce	[AOTI] Add _scaled_dot_product_efficient_attention to C shim (#118169 ) Summary: _scaled_dot_product_efficient_attention is used in some TIMM models Differential Revision: [D53032358](https://our.internmc.facebook.com/intern/diff/D53032358) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118169 Approved by: https://github.com/chenyang78 ghstack dependencies: #118168	2024-01-25 11:53:44 +00:00
Bin Bao	5c7a18c5cb	[AOTI] Refactor shim_common.cpp (#118168 ) Summary: Use new_tensor_handle to reduce code repetition Differential Revision: [D53032353](https://our.internmc.facebook.com/intern/diff/D53032353) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118168 Approved by: https://github.com/chenyang78	2024-01-25 11:53:29 +00:00
Will Constable	a40951defd	[C10D] Fix nccl flightrecorder ignored dump timeout (#118142 ) Don't call future.get() unless it's ready, because it waits. Also, refactor the code a bit for simplicity. We should do a follow-on PR to clean up the timeouts further, but this should fix the glaring timeout bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118142 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #118044, #118046, #118047	2024-01-25 04:25:36 +00:00
cyy	87335fabae	[Exception] [6/N] Remove use of torch::TypeError (#117964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117964 Approved by: https://github.com/albanD	2024-01-25 03:35:58 +00:00
soulitzer	67300a11cb	Support custom autograd Function forward AD return non-Tensor in forward (#118234 ) Fixes https://github.com/pytorch/pytorch/issues/117491 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118234 Approved by: https://github.com/albanD ghstack dependencies: #117552	2024-01-25 03:24:29 +00:00
soulitzer	5b819d9ef0	Properly move retains_grad hook on in-place over view for base (#117552 ) Fixes https://github.com/pytorch/pytorch/issues/117366 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117552 Approved by: https://github.com/albanD	2024-01-25 00:27:13 +00:00
drisspg	4e29f01bf2	Remove sdp_kernel and replace with sdpa_kernel in attention namespace (#114689 ) # Summary Simplification of Backend Selection This PR deprecates the `torch.backends/cuda/sdp_kernel` context manager and replaces it with a new context manager `torch.nn.attention.sdpa_kernel`. This context manager also changes the api for this context manager. For `sdp_kernel` one would specify the backend choice by taking the negation of what kernel they would like to run. The purpose of this backend manager was to only to be a debugging tool, "turn off the math backend" and see if you can run one of the fused implementations. Problems: - This pattern makes sense if majority of users don't care to know anything about the backends that can be run. However, if users are seeking to use this context manager then they are explicitly trying to run a specific backend. - This is not scalable. We are working on adding the cudnn backend and this API makes it so so that more implementations will need to be turned off if user wants to explicitly run a given backend. - Discoverability of the current context manager. It is somewhat un-intutive that this backend manager is in backends/cuda/init when this now also controls the CPU fused kernel behavior. I think centralizing to attention namespace will be helpful. Other concerns: - Typically backends (kernels) for operators are entirely hidden from users and implementation details of the framework. We have exposed this to users already, albeit not by default and with beta warnings. Does making backends choices even more explicit lead to problems when we potentially want to remove existing backends, (perhaps inputs shapes will get covered by newer backends). A nice side effect is now that we aren't using the `BACKEND_MAP` in test_transformers many, many dynamo failures are passing for CPU tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114689 Approved by: https://github.com/cpuhrsch	2024-01-24 22:28:04 +00:00
Bin Bao	821b2c543c	[AOTI] Support .item() in the ABI-compatible mode (#117989 ) Summary: Differential Revision: [D52965076](https://our.internmc.facebook.com/intern/diff/D52965076) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117989 Approved by: https://github.com/ezyang, https://github.com/chenyang78	2024-01-24 20:17:59 +00:00
dilililiwhy	b025e5984c	Get Device instance with correct type when privateuse1 backend is registered (#117966 ) Fixes #ISSUE_NUMBER If privateuse1 backend is registered. Let torch.device return corresponding instance of Device when only index is given. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117966 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-24 19:03:18 +00:00
Ke Wen	1e185c7803	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-24 18:42:14 +00:00
Mikayla Gawarecki	41a56f7828	Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955 ) This PR intends to fix the following issue when swapping two tensors ```python >>> import torch >>> torch.manual_seed(5) >>> t1 = torch.randn(2) >>> t2 = torch.randn(3) >>> t1 tensor([-0.4868, -0.6038]) >>> t2 tensor([-0.5581, 0.6675, -0.1974]) >>> torch.utils.swap_tensors(t1, t2) >>> t1 tensor([-0.5581, 0.6675, -0.1974]) >>> t2 tensor([-0.4868, -0.6038]) >>> t1.fill_(0.5) # t1 back to its unswapped state :o tensor([-0.4868, -0.6038]) ``` What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned. `57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)` When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead. The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955 Approved by: https://github.com/albanD	2024-01-24 01:40:18 +00:00
Nikita Shulga	bff348b28f	[AOTI] Add missing include to `model.h` (#118075 ) At lest if one tries to compile the AOTI code on Darwin, compilation fails with implicit instantiation of undefined template error: ``` In file included from /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:3: /Users/nshulga/git/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model.h:69:21: error: implicit instantiation of undefined template 'std::basic_stringstream<char>' std::stringstream ss; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118075 Approved by: https://github.com/desertfire ghstack dependencies: #118074	2024-01-23 14:34:00 +00:00
Will Constable	455bba38f4	[C10D] Make Flight Recorder report time_created in ns (#118047 ) Addresses (6) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118047 Approved by: https://github.com/zdevito ghstack dependencies: #118044, #118046	2024-01-23 08:18:08 +00:00
Will Constable	5df92a9244	[C10D] Add version tag to NCCL Flight Recorder Dump (#118046 ) Addresses (3) from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118046 Approved by: https://github.com/zdevito ghstack dependencies: #118044	2024-01-23 08:18:08 +00:00
Will Constable	dace1fda2e	[C10D] Make NCCL Flight Recorder dump produce a dict (#118044 ) Putting the list of entries into a particular key of a top-level dict paves the way for adding other metadata as other top level keys. Addresses 1 and 2 from #117883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118044 Approved by: https://github.com/zdevito	2024-01-23 08:18:08 +00:00
Will Constable	6049998971	[C10D] Finer-grain nccl heartbeat, avoid false positive hangs (#118016 ) Summary: Previously, heatbeat was incremented once per finishing a for loop over a list of in-progress work items, under the assumption that either the processing would be predictably quick, or it would hang completely. In fact, there can be cuda API contention that causes the processing of works to slow down arbitrarily but not truly deadlock. To guard against this, we bump the heartbeat at the smallest unit of progress, one work item being successfully processed. Test Plan: CI Differential Revision: D52973948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118016 Approved by: https://github.com/shuqiangzhang, https://github.com/kwen2501	2024-01-23 07:25:18 +00:00
PyTorch MergeBot	b5799d9977	Revert "[c10d] Barrier uses stream sync instead of device sync (#117804 )" This reverts commit `0f6bbb1c07`. Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788. Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))	2024-01-22 21:54:03 +00:00
Ke Wen	0f6bbb1c07	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-22 20:14:51 +00:00
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
Guilherme Leobas	80cf0ce153	Enhance torch.vmap support from inside torch.compile (#116050 ) This work rewrites vmap support in torch.compile by inlining most of the frames into the existing FX graph. It also unlocks to PyTorch to support features that were previously missing, such as keyword args. Fixes: https://github.com/pytorch/pytorch/issues/114306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050 Approved by: https://github.com/zou3519	2024-01-22 17:53:45 +00:00
cyy	39df084001	[Clang-tidy header][16/N] Enable clang-tidy on headers in torch/csrc/autograd (#117821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117821 Approved by: https://github.com/Skylion007	2024-01-22 00:52:56 +00:00
eqy	8f7caaee67	[cuDNN] Fix cuDNN version parsing against future versions of cuDNN (#117908 ) Remove the unnecesssary dependence on assuming a fixed number of digits per field CC @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/117908 Approved by: https://github.com/cpuhrsch	2024-01-21 05:00:01 +00:00
fduwjj	05ef2030ea	[c10d] Add logs for NCCL Comm Abort call (#117868 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117868 Approved by: https://github.com/kwen2501	2024-01-20 21:34:13 +00:00
Wei Lu	a1b3b5748f	[Pytoch][Vulkan] Create context for conv1d (#117780 ) Summary: `conv1d` has two arguments `weight` and `bias` which are stored as constant tensors on the CPU and they are transferred to GPU at every inference call. We create a context for this operator to avoid the repeated passing. Specifically, we - created `Conv1dPackedContext`,`create_conv1d_context` and `run_layernorm_context` in `Convolution.h` and `Convolution.cpp` - registered them in `Register.cpp` - rewrote the graph representation of the op in `vulkan_rewrite.cpp` Test Plan: ## Numerical test ``` [luwei@82308.od /data/sandcastle/boxes/fbsource (8a8d911dc)]$ LD_LIBRARY_PATH=third-party/swiftshader/lib/linux-x64/ buck run fbcode/mode/dev-nosan //xplat/caffe2:pt_vulkan_api_test_bin -- --gtest_filter="conv1d" Buck UI: https://www.internalfb.com/buck2/7760800b-fd75-479a-9368-be5fcd5a7fef Network: Up: 0B Down: 0B Jobs completed: 4. Time elapsed: 0.6s. BUILD SUCCEEDED Running main() from third-party/googletest/1.14.0/googletest/googletest/src/gtest_main.cc Note: Google Test filter = conv1d [==========] Running 2 tests from 1 test suite. [----------] Global test environment set-up. [----------] 2 tests from VulkanAPITest [ RUN ] VulkanAPITest.conv1d_simple [ OK ] VulkanAPITest.conv1d_simple (159 ms) [ RUN ] VulkanAPITest.conv1d [ OK ] VulkanAPITest.conv1d (57 ms) [----------] 2 tests from VulkanAPITest (217 ms total) [----------] Global test environment tear-down [==========] 2 tests from 1 test suite ran. (217 ms total) [ PASSED ] 2 tests. ``` Full test result in P1053644934, summary as below ``` [----------] 419 tests from VulkanAPITest (28080 ms total) [----------] Global test environment tear-down [==========] 419 tests from 1 test suite ran. (28080 ms total) [ PASSED ] 418 tests. [ SKIPPED ] 1 test, listed below: [ SKIPPED ] VulkanAPITest.querypool_flushed_shader_log ``` ## Graph representation comparison We created a model using `conv1d` and traced it as below ``` # Define a simple model that uses conv1d class MyModel(torch.nn.Module): def __init__(self): super(MyModel, self).__init__() self.conv1d = nn.Conv1d(16, 33, 3) def forward(self, x): return self.conv1d(x) # Create an instance of the model model = MyModel() # Create a dummy input tensor for tracing input_tensor = torch.randn(20, 16, 50) # Use torch.jit.trace to trace the model and generate a graph traced_model = torch.jit.trace(model, input_tensor) ``` Then we converted the traced model to Vulkan backend using `optimize_for_mobile` ``` from torch.utils import mobile_optimizer vulkan_model = mobile_optimizer.optimize_for_mobile( traced_model, backend="vulkan", preserved_methods=to_preserve ) ``` Next we can print the graph of the `vulkan_model` as `print(vk_model.graph)` - before this diff: `conv1d` was used ``` graph(%self.1 : __torch__.___torch_mangle_16.MyModel, %x : Tensor): %60 : Device = prim::Constant[value="cpu"]() %self.conv1d.bias : Float(33, strides=[1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]() %37 : bool = prim::Constant[value=0]() %36 : NoneType = prim::Constant() %59 : Device = prim::Constant[value="vulkan"]() %self.conv1d.weight : Float(33, 16, 3, strides=[48, 3, 1], requires_grad=0, device=cpu) = prim::Constant[value=<Tensor>]() %7 : int = prim::Constant[value=1](), scope: __module.conv1d # /mnt/xarfuse/uid-23453/243f3953-seed-nspid4026532834_cgpid7972545-ns-4026532831/torch/nn/modules/conv.py:306:0 %18 : int[] = prim::Constant[value=[1]]() %19 : int[] = prim::Constant[value=[0]]() %39 : Tensor = aten::to(%x, %59, %36, %37, %37) %20 : Tensor = aten::conv1d(%39, %self.conv1d.weight, %self.conv1d.bias, %18, %19, %18, %7) %58 : Tensor = aten::to(%20, %60, %36, %37, %37) return (%58) ``` - after this diff: `conv1d` was replaced with `run_conv1d_context` ``` graph(%self.1 : __torch__.___torch_mangle_6.MyModel, %x : Tensor): %85 : Device = prim::Constant[value="cpu"]() %51 : bool = prim::Constant[value=0]() %50 : NoneType = prim::Constant() %84 : Device = prim::Constant[value="vulkan"]() %53 : Tensor = aten::to(%x, %84, %50, %51, %51) %prepack_folding_forward._jit_pass_packed_weight_0 : __torch__.torch.classes.vulkan.Conv1dPackedContext = prim::GetAttr[name="prepack_folding_forward._jit_pass_packed_weight_0"](%self.1) %22 : Tensor = vulkan_prepack::run_conv1d_context(%53, %prepack_folding_forward._jit_pass_packed_weight_0) %83 : Tensor = aten::to(%22, %85, %50, %51, %51) return (%83) ``` Differential Revision: D52865379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117780 Approved by: https://github.com/yipjustin	2024-01-20 02:35:32 +00:00
Scott Wolchok	ad3d41692e	[PyTorch] return `decltype(auto)` from getItem (#117569 ) This allows getItem to take advantage of the nicer (sometimes-const-reference) return type from `List::get() const` added in the previous diff. Differential Revision: [D52809097](https://our.internmc.facebook.com/intern/diff/D52809097/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117569 Approved by: https://github.com/iseeyuan, https://github.com/malfet ghstack dependencies: #117568	2024-01-19 21:04:53 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit `74e1362499`. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
dilililiwhy	924ed91612	Move getDurationFromFirstEvent to USE_C10D_NCCL ifdef (#117738 ) Fixes #117517 Try to move nccl related function getDurationFromFirstEvent to USE_C10D_NCCL ifdef (Related to https://github.com/pytorch/pytorch/issues/114575) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117738 Approved by: https://github.com/wconstab, https://github.com/XilunWu	2024-01-19 04:28:47 +00:00
cyy	38d9b3d937	Remove use of math_compat.h (#116167 ) Because ANDROID>=21 is assumed in CI tests, it is time to remove old workarounds. math_compat.h contains solely wrapper math functions for ANDROID, so we can remove its usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116167 Approved by: https://github.com/ezyang	2024-01-19 03:37:55 +00:00
cyy	5c17f66a3d	[Exception] [5/N] Remove torch::IndexError (#117713 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/117713 Approved by: https://github.com/ezyang	2024-01-19 03:36:15 +00:00
Ke Wen	c16e6e4cf7	[ProcessGroup] Make watchdog check work queue more frequently (#117297 ) Today watchdog's sleep interval is 1s. That's a bit long compared to modern GPU link's (or network link's) speed. Take DDP and Ampere for example: DDP's bucket size = 25 MB Ampere's NVLink speed = 250 GB/s 25 MB / 250 GB/s = 100 ms. So we are updating the interval to 100 ms. Update: 25 MB / 250 GB/s = 0.1 ms But let's see how it goes so far between making the checking more aggressive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117297 Approved by: https://github.com/fduwjj	2024-01-19 02:33:31 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
PyTorch MergeBot	2f84a9d37c	Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 )" This reverts commit `5aa92b5090`. Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))	2024-01-18 23:40:30 +00:00
Jason Ansel	a669319450	[inductor] Faster C++ kernel python bindings (#117500 ) Calling C++ from Python via ctypes is notoriously slow. This switches to generating our own C++ bindings directly, which is a >5x speedup on this kernel-launch-bound microbenchmark: ```python from ctypes import c_void_p import torch from torch import empty from torch._inductor.codecache import AsyncCompile from torch._dynamo.testing import rand_strided from torch._inductor.utils import print_performance from torch._inductor.wrapper_benchmark import compiled_module_main async_compile = AsyncCompile() src = ''' #include "/tmp/torchinductor_jansel/gb/cgbau5vlj6cetmcjbjbtw6x4rrivaln6f45s5d72gy2bfx5foz3k.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { auto tmp0 = in_ptr0[static_cast<long>(0L)]; auto tmp1 = static_cast<float>(1.0); auto tmp2 = decltype(tmp0)(tmp0 + tmp1); out_ptr0[static_cast<long>(0L)] = tmp2; } } ''' cpp_fused_add_ctypes = async_compile.cpp(src) cpp_fused_add_cpython = async_compile.cpp_pybinding(["const float", "float"], src) async_compile.wait(globals()) del async_compile def call(arg0_1): buf0 = empty((1,), device='cpu', dtype=torch.float32) if use_ctypes: for _ in range(100): cpp_fused_add_ctypes(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr())) else: for _ in range(100): cpp_fused_add_cpython(arg0_1, buf0) del arg0_1 return (buf0,) def benchmark_compiled_module(times=1000, repeat=100): arg0_1 = rand_strided((1,), (1,), device='cpu', dtype=torch.float32) return print_performance(lambda: call(arg0_1), times=times, repeat=repeat) print("old ctypes bindings: ", end='') use_ctypes = True compiled_module_main('None', benchmark_compiled_module) print("new bindings: ", end='') use_ctypes = False compiled_module_main('None', benchmark_compiled_module) ``` Output: ``` old ctypes bindings: 0.000073 new bindings: 0.000013 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117500 Approved by: https://github.com/desertfire	2024-01-18 16:20:12 +00:00
Bin Bao	26956980c6	[AOTI] Add torch._export.aot_load (#117610 ) Summary: Add a torch._export.aot_load API that can load an AOTInductor-compiled model.so into a python executable. Test Plan: CI Differential Revision: D52825456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117610 Approved by: https://github.com/angelayi, https://github.com/khabinov, https://github.com/chenyang78	2024-01-18 15:02:16 +00:00
Tobias Ringwald	bc9cb04822	Replaced CHECK with TORCH_CHECK in order to not abort, but throw a Ru… (#117653 ) …ntimeError instead. Fixes #117499. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117653 Approved by: https://github.com/antoniojkim, https://github.com/JackCaoG, https://github.com/alanwaketan	2024-01-18 07:47:22 +00:00
Ke Wen	6d96beb6be	[c10d] Remove health check (#117699 ) https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called). If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699 Approved by: https://github.com/wconstab	2024-01-18 02:14:49 +00:00

... 3 4 5 6 7 ...

13670 Commits