pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	bcb4f7c172	Revert "Grouped Query Attention (#128898 )" This reverts commit `6b28af1b79`. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))	2024-08-02 18:58:46 +00:00
Xuehai Pan	672ce4610e	Populate submodules of `torch._C` to `sys.modules` recursively (#132216 ) See comment: `e9d1c26275/torch/__init__.py (L938-L950)` This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216 Approved by: https://github.com/ezyang	2024-08-01 12:04:59 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
jainapurva	6b28af1b79	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-31 22:58:51 +00:00
Luca Wehrstedt	f4f7aba75d	Expose function to probe whether PyTorch was built with FlashAttention (#131894 ) This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-07-31 11:33:09 +00:00
Matthew Hoffman	deb788f6cc	Merge `torch.nn.utils.rnn` type stubs (#131872 ) I want to re-attempt: * #61467 See: * https://github.com/pytorch/pytorch/issues/10536#issuecomment-2251948730 and this is one of the files I would touch. quoting @ezyang: * https://github.com/pytorch/pytorch/issues/91648#issuecomment-1372010129 > The back story here is that in https://github.com/pytorch/pytorch/pull/19089 we added pyi stubs for nn modules, but when we got off Python 2 we started merging the pyi stubs directly into the py files, e.g., as in https://github.com/pytorch/pytorch/pull/43044. But not all the modules got the treatment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131872 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-07-31 02:24:59 +00:00
Shreyans Pathak	12b67bd998	Fix pyi annotation for `ProcessGroupGloo.Options` (#132080 ) This PR fixes the pyi annotation for `ProcessGroupGloo.Options` based on the definition in the `torch/csrc/distributed/c10d/init.cpp` file. Fixes #132054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132080 Approved by: https://github.com/Skylion007	2024-07-30 13:52:31 +00:00
PyTorch MergeBot	499ead96ff	Revert "Grouped Query Attention (#128898 )" This reverts commit `d039b14207`. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))	2024-07-30 13:11:24 +00:00
jainapurva	d039b14207	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-29 21:49:06 +00:00
Simon Mahns	dcb03106b7	[Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007 ) Summary: as title Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962 Differential Revision: D60335413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007 Approved by: https://github.com/hanzlfs, https://github.com/egienvalue	2024-07-29 20:47:18 +00:00
Edward Z. Yang	6c6fbb4691	Fix pyi annotation for ProcessGroupNCCL.Options (#130957 ) Probably all the other options need updating too, but this is the one I needed. The accurate annotation was determined by reading torch/csrc/distributed/c10d/init.cpp Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130957 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-07-29 17:46:01 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `709ddf7a9d`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
Boyuan Feng	40cc5c0697	[AOT Autograd] Donated Buffer (#130580 ) Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor. Fixes #129496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580 Approved by: https://github.com/bdhirsh	2024-07-26 17:14:34 +00:00
Boyuan Feng	16d7cb5049	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 06:14:06 +00:00
PyTorch MergeBot	03f49c9523	Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 )" This reverts commit `16699c7d84`. Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))	2024-07-26 02:08:45 +00:00
Boyuan Feng	16699c7d84	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 01:40:23 +00:00
PyTorch MergeBot	b343644f3a	Revert "MTIA equivalent of torch.cuda.memory_stats (#131673 )" This reverts commit `513ce5f69a`. Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))	2024-07-26 00:54:37 +00:00
Mikayla Gawarecki	709ddf7a9d	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-25 22:23:38 +00:00
Simon Mahns	513ce5f69a	MTIA equivalent of torch.cuda.memory_stats (#131673 ) Summary: Adding MTIA equivalent of `torch.cuda.memory_stats` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673 Approved by: https://github.com/egienvalue	2024-07-25 21:59:59 +00:00
Animesh Jain	e2b941a1b4	[dynamo] Rename TENSOR_ALIASING to OBJECT_ALIASING. Permit OBJECT_ALIASING for dict guards (#131480 ) Fixes https://github.com/pytorch/pytorch/issues/129667 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131480 Approved by: https://github.com/williamwen42 ghstack dependencies: #131347, #131367, #131378, #131389, #131405	2024-07-24 00:06:53 +00:00
Yifu Wang	161c18ed0b	SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter (#130583 ) ```python # NOTE [low-contention collectives] # When a collective is overlapped with abundant compute, it makes sense to # prioritize reducing the contention between the collective and the overlapped # compute, even at the cost of a slightly slower collective. # # Common collective implementations (e.g., NCCL without user buffer # registration) optimize for throughput with no ambient compute. However, such # implementations may not be optimal when they are overlapped with compute: # - These impls typically fuse the entire collective into a single kernel and # reserve SM resources based on the most demanding portion of the collective, # even when a large portion of the collective does not require this much # resource. # - These implementations typically fuse the entire collective into a single # kernel and reserve SM resources based on the most demanding portion of the # collective, even when a large portion of the collective does not require this # much resource. # - These implementations often use SM-based P2P copy as opposed to copy # engine-based P2P copy. Copy engine-based P2P copy may not have a significant # advantage when there's no ambient compute. However, it may significantly # improve overall resource utilization in the presence of ambient compute. # # When overlapped with intensive compute (e.g., persistent matmul kernels), the # SM-usage of a collective can lead to inefficient overlapping. # # Low-contention collectives achieve their goals with the following strategies: # - Use copy engine-based copy whenever possible. # - Break down portions of a collective with different resource requirements # into multiple kernels. This improves the overlapping efficiency at the cost # of additional launching overhead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583 Approved by: https://github.com/weifengpy	2024-07-23 23:37:48 +00:00
PyTorch MergeBot	e4b5645f83	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit `5b5e0698a5`. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))	2024-07-23 17:19:34 +00:00
Oguz Ulgen	4ca8705035	Add mypy typing to fx node (#131434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434 Approved by: https://github.com/zou3519	2024-07-23 17:00:31 +00:00
Mikayla Gawarecki	5b5e0698a5	Add wrappers for synchronous GPUDirect Storage APIs (#130633 ) Based in part on https://github.com/NVIDIA/apex/pull/1774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633 Approved by: https://github.com/albanD	2024-07-22 14:51:24 +00:00
Florian	1614891946	[Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733 ) Fixes #[130730](https://github.com/pytorch/pytorch/issues/130730) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130733 Approved by: https://github.com/aaronenyeshi	2024-07-22 04:35:21 +00:00
Aaron Orenstein	b193894b94	FakeTensor cache SymInt support (#127596 ) Adds support for SymInts in the FakeTensor cache. A couple notes: 1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain. 2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail. 3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not). 4. In the cache entry we store SymInts as a _SymIntOutputStub. Perf example: ``` python benchmarks/dynamo/timm_models.py --ci --accuracy --timing --explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda --training --amp --total-partitions 2 --partition-id 0 --output /tmp/training_timm_models.csv --filter crossvit_9_240 ``` fake tensor cache before: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 68137 INFO: cache_misses: 837 INFO: cache_bypasses: INFO: symbolic shape: 48224 INFO: CompositeImplicitAutograd: 917 INFO: non-fake tensor: 70 INFO: non-FakeTensor output: 62 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` and after: ``` INFO: FakeTensor cache stats: INFO: cache_hits: 88187 INFO: cache_misses: 14233 INFO: cache_bypasses: INFO: CompositeImplicitAutograd: 1037 INFO: non-FakeTensor output: 602 INFO: non-fake tensor: 70 INFO: unsafe view: 36 INFO: non-builtin: 8 INFO: dynamic output shape: 1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596 Approved by: https://github.com/eellison ghstack dependencies: #131014, #129780	2024-07-21 19:26:38 +00:00
Wu, Chunyuan	a8319698b3	[inductor] [cpp] improve cache blocking with CPU info (#129348 ) ## Description For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition: - size_of_B < L1 - size_of_A < 0.5 * L2 For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations. ## Performance No regressions. Models with > 3% performance speedup are listed below: ### BF16 single thread (measured on CPU with AMX support) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| detectron2_fasterrcnn_r_101_dc5\| 4% ### FP32 single thread (measured on Ice Lake) - static shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% - dynamic shape \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| basic_gnn_edgecnn\| 10% ### Next step The E2E level improvement is limited due to the below reasons: - For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change. - There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement. We will continue to find possible optimizations in the gemm template kernel in follow-up PRs. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #130675, #130690	2024-07-20 06:53:31 +00:00
PyTorch MergeBot	7c299b46ca	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit `8390843eba`. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))	2024-07-19 22:58:51 +00:00
Isuru Fernando	8390843eba	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-16 14:29:29 +00:00
PyTorch MergeBot	78799e82b0	Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 )" This reverts commit `1bc390c5f5`. Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 `1bc390c5f5`. Test was introduced by `fa5f572748` which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))	2024-07-15 21:59:46 +00:00
Shuqiang Zhang	77fb5b0e23	[c10d] a new Pytorch API (split_group) to create a process group (#130507 ) This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407 ncclCommSplit Summary: In current Pytorch/c10d, the new_group API is used to create a new process group from the default pg. When device_id is specified in init_process_group and nccl is used as the backend, the new_group call will use ncclCommSplit to create the nccl communicators to save communicator resources. It has a few drawbacks: Redundant calls Suppose the default group has 256 ranks, we need to have 32 children PGs and each child PG has 8 ranks. in this case, each rank needs to call new_group and ncclCommSplit 32 times because of how we implement new_group API and the collective requirement of ncclCommSplit. For a specific global rank, 31 calls of ncclCommSplit would be no_color split, and only 1 of them is colored split. With the proposed new split_group API, we expect only 1 call of split_group/ncclCommSplit is needed per rank in the above example case new_group can only split from default_pg Ideally, a new pg should be able to be split from any pg With the new split_group API, users can create new PGs using ncclCommSplit with less number of calls and initialize the PG eagerly. This is also useful in the cases of creating many P2P communicators. Test Plan: New UTs: e.g., python test/distributed/test_c10d_nccl.py -k test_comm_split_group_larger_scale Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507 Approved by: https://github.com/wconstab	2024-07-15 21:26:43 +00:00
Isuru Fernando	1bc390c5f5	Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264 ) Fixes #104435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264 Approved by: https://github.com/ezyang	2024-07-15 04:16:17 +00:00
Shivam Raikundalia	6f275ae4d0	Add kwinputs to Kineto Traces (#130373 ) Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces. Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested. Differential Revision: D59472345 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373 Approved by: https://github.com/davidberard98	2024-07-14 00:40:59 +00:00
Aaron Orenstein	567482973d	typing fake_tensor.py (#128041 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041 Approved by: https://github.com/eellison ghstack dependencies: #129182	2024-07-13 06:07:40 +00:00
Wang, Eikan	f52b2ee90f	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`. Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman	2024-07-11 13:17:25 +00:00
Ivan Zaitsev	6b3460ae0d	fix discrepancy from the export of #126601 (#130296 ) #126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-07-10 17:26:44 +00:00
Xuehai Pan	3f50e197c4	[BE] annotate `torch.autograd.graph` (#129558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558 Approved by: https://github.com/soulitzer	2024-07-06 18:14:16 +00:00
Xuehai Pan	d1d0a7080f	[torchgen] reference generated comment to actual location of the generator and template (#130020 ) As per title. ```diff # torch/_VF.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/return_types.pyi - # @generated from torch/_C/return_types.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in ``` ```diff # torch/_C/__init__.pyi - # @generated from torch/_C/__init__.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in ``` ```diff # torch/_C/_nn.pyi + # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in ``` ```diff # torch/_C/_VariableFunctions.pyi - # @generated from torch/_C/_VariableFunctions.pyi.in + # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in ``` ```diff # torch/nn/functional.pyi + # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020 Approved by: https://github.com/ezyang	2024-07-05 21:47:14 +00:00
Ramana Cherukuri	f6a0be5023	Add warpSize to Device properties (#128449 ) Adding warp_size to CudaDeviceProperties. >>> import torch >>> prop = torch.cuda.get_device_properties(torch.cuda.current_device()) >>> prop.warp_size 64 >>> @jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449 Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/jithunnair-amd, https://github.com/malfet	2024-07-01 09:13:32 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
Xuehai Pan	56935684c3	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-29 09:23:39 +00:00
PyTorch MergeBot	83caf4960f	Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 )" This reverts commit `e40f50cb87`. Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))	2024-06-29 00:44:24 +00:00
FEI	59e4e92556	sdp::SDPBackend::flash_attention support PrivateUse1 (#126392 ) Fixes https://github.com/pytorch/pytorch/issues/124271 cc @cpuhrsch @drisspg @albanD @soulitzer Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392 Approved by: https://github.com/drisspg	2024-06-28 17:48:40 +00:00
Xuehai Pan	e40f50cb87	Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in `.pyi` stub files (#129419 ) ------ - [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`. - [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X \| Y`, `Optional[X] -> X \| None`, `Optional[Union[X, Y]] -> X \| Y \| None`. Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449: - #117449 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419 Approved by: https://github.com/ezyang ghstack dependencies: #129375, #129376	2024-06-28 15:37:57 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `b7e7a4cb01`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
Jeff Daily	169b4ca07e	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet	2024-06-27 23:53:13 +00:00
Chen, Zejun	a028e5862d	[profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554 ) Use the raw end_ns directly, instead of the sum of start_ns and duration_ns, in order to avoid negative CPU time in profiler. Fix https://github.com/pytorch/pytorch/issues/101861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129554 Approved by: https://github.com/gujinghui, https://github.com/aaronenyeshi	2024-06-27 10:46:05 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
Yifu Wang	bbd47f7b2f	Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762 ) This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762 Approved by: https://github.com/wanchaol	2024-06-25 22:32:21 +00:00
Xuehai Pan	93a33bf3ac	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 18:04:38 +00:00
PyTorch MergeBot	cb4919344a	Revert "[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 )" This reverts commit `e53d959028`. Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))	2024-06-24 16:18:43 +00:00
Xuehai Pan	e53d959028	[BE] update type annotations for basic utilities in `torch/__init__.py` (#129001 ) Changes: 1. Make some arguments positional-only as we only support Python 3.8+ 2. Clean up `torch.typename(obj)` implementation. 3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001 Approved by: https://github.com/malfet	2024-06-24 14:35:41 +00:00
Yifu Wang	217aac96d7	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-21 08:49:11 +00:00
Jiong Gong	914d3ca2ba	[inductor][cpp] BF16 AMX micro-gemm support (#127195 ) This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`. Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C: Static shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| timm_models \| mixer_b16_224 \| 1.54 \| \| timm_models \| convit_base \| 1.53 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.52 \| \| torchbench \| fastNLP_Bert \| 1.44 \| \| torchbench \| llama \| 1.33 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.31 \| \| torchbench \| dlrm \| 1.28 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| huggingface \| MobileBertForMaskedLM \| 1.27 \| \| timm_models \| vit_base_patch16_224 \| 1.26 \| \| timm_models \| beit_base_patch16_224 \| 1.23 \| \| timm_models \| jx_nest_base \| 1.21 \| \| torchbench \| pyhpc_equation_of_state \| 1.18 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.15 \| \| timm_models \| pit_b_224 \| 1.14 \| \| timm_models \| twins_pcpvt_base \| 1.14 \| \| torchbench \| maml_omniglot \| 1.1 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| torchbench \| BERT_pytorch \| 1.35 \| \| torchbench \| lennard_jones \| 2.43 \| \| torchbench \| hf_Albert \| 1.35 \| \| torchbench \| hf_T5 \| 1.34 \| \| torchbench \| soft_actor_critic \| 1.34 \| \| torchbench \| fastNLP_Bert \| 1.28 \| \| huggingface \| LayoutLMForSequenceClassification \| 1.26 \| \| torchbench \| llama \| 1.24 \| \| huggingface \| GPT2ForSequenceClassification \| 1.19 \| \| torchbench \| hf_Bart \| 1.17 \| \| torchbench \| hf_Bert_large \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| timm_models \| gmixer_24_224 \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.15 \| \| torchbench \| maml_omniglot \| 1.14 \| \| torchbench \| hf_Bert \| 1.13 \| \| torchbench \| hf_DistilBert \| 1.13 \| \| torchbench \| hf_T5_large \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.11 \| Dynamic shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| timm_models \| mixer_b16_224 \| 1.52 \| \| timm_models \| convit_base \| 1.5 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.49 \| \| torchbench \| fastNLP_Bert \| 1.42 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.27 \| \| torchbench \| llama \| 1.26 \| \| huggingface \| MobileBertForMaskedLM \| 1.25 \| \| timm_models \| vit_base_patch16_224 \| 1.25 \| \| timm_models \| beit_base_patch16_224 \| 1.24 \| \| timm_models \| jx_nest_base \| 1.2 \| \| torchbench \| dlrm \| 1.19 \| \| timm_models \| pit_b_224 \| 1.13 \| \| timm_models \| twins_pcpvt_base \| 1.13 \| \| torchbench \| hf_Bert_large \| 1.12 \| \| torchbench \| hf_BigBird \| 1.11 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.11 \| \| timm_models \| eca_botnext26ts_256 \| 1.11 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| torchbench \| BERT_pytorch \| 1.18 \| \| torchbench \| lennard_jones \| 2.18 \| \| torchbench \| hf_Albert \| 1.37 \| \| torchbench \| soft_actor_critic \| 1.31 \| \| huggingface \| GPT2ForSequenceClassification \| 1.29 \| \| torchbench \| hf_T5 \| 1.28 \| \| torchbench \| fastNLP_Bert \| 1.27 \| \| torchbench \| hf_Bart \| 1.21 \| \| torchbench \| hf_Bert_large \| 1.19 \| \| torchbench \| hf_T5_large \| 1.19 \| \| torchbench \| hf_Bert \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| huggingface \| CamemBert \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.13 \| \| torchbench \| functorch_maml_omniglot \| 1.12 \| \| huggingface \| BertForMaskedLM \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.12 \| \| torchbench \| hf_DistilBert \| 1.11 \| \| timm_models \| mixnet_l \| 1.11 \| \| timm_models \| tf_mixnet_l \| 1.11 \| No perf regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195 Approved by: https://github.com/jansel	2024-06-21 07:21:47 +00:00
cyy	5c676bb8b3	Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-06-21 06:16:44 +00:00
Deng Weishi	b542825066	Enable deterministic support for oneDNN (#127277 ) This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848. For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui	2024-06-21 05:21:24 +00:00
PyTorch MergeBot	63a724d8e1	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit `8771e3429c`. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))	2024-06-20 22:31:29 +00:00
PyTorch MergeBot	e84cf805d2	Revert "Modularize aten parameter parser and checker (#125308 )" This reverts commit `60bbdc0b40`. Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))	2024-06-20 18:52:05 +00:00
Yifu Wang	8771e3429c	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-19 03:38:58 +00:00
PyTorch MergeBot	77830d509f	Revert "Introduce a prototype for SymmetricMemory (#128582 )" This reverts commit `7a39755da2`. Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))	2024-06-18 18:11:43 +00:00
cyy	163847b1bb	[1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675 Approved by: https://github.com/r-barnes	2024-06-17 21:25:59 +00:00
Yifu Wang	7a39755da2	Introduce a prototype for SymmetricMemory (#128582 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them. ### SymmetricMemory `SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for op-level custom communication patterns (via the get_buffer APIs and the synchronization primitives), as well as custom communication kernels (via the buffer and signal_pad device pointers). ### Python API Example ```python from torch._C.distributed_c10d import _SymmetricMemory # Set a store for rendezvousing symmetric allocations on a group of devices # identified by group_name. The concept of groups is logical; users can # utilize predefined groups (e.g., a group of device identified by a # ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator # backends might employ a more efficient communication channel for the actual # rendezvous process and only use the store for bootstrapping purposes. _SymmetricMemory.set_group_info(group_name, rank, world_size, store) # Identical to empty_strided, but allows symmetric memory access to be # established for the allocated tensor via _SymmetricMemory.rendezvous(). # This function itself is not a collective operation. t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name) # Users can write Python custom ops that leverages the symmetric memory access. # Below are examples of things users can do (assuming the group's world_size is 2). # Establishes symmetric memory access on tensors allocated via # _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process, # and the mapping between a local memory region and the associated SymmetricMemory # object is unique. Subsequent calls to rendezvous() with the same tensor will receive # the cached SymmetricMemory object. # # The function has a collective semantic and must be invoked simultaneously # from all rendezvous participants. symm_mem = _SymmetricMemory.rendezvous(t) # This represents the allocation on rank 0 and is accessible from all devices. buf = symm_mem.get_buffer(0, (64, 64), torch.float32) if symm_mem.rank == 0: symm_mem.wait_signal(src_rank=1) assert buf.eq(42).all() else: # The remote buffer can be used as a regular tensor buf.fill_(42) symm_mem.put_signal(dst_rank=0) symm_mem.barrier() if symm_mem.rank == 0: symm_mem.barrier() assert buf.eq(43).all() else: new_val = torch.empty_like(buf) new_val.fill_(43) # Contiguous copies to/from a remote buffer utilize copy engines # which bypasses SMs (i.e. no need to load the data into registers) buf.copy_(new_val) symm_mem.barrier() ``` ### Custom CUDA Comm Kernels Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels. ```cpp TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory( const at::Tensor& tensor); class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target { public: ... virtual std::vector<void> get_buffer_ptrs() = 0; virtual std::vector<void> get_signal_pad_ptrs() = 0; virtual void get_buffer_ptrs_dev() = 0; virtual void get_signal_pad_ptrs_dev() = 0; virtual size_t get_buffer_size() = 0; virtual size_t get_signal_pad_size() = 0; virtual int get_rank() = 0; virtual int get_world_size() = 0; ... }; ``` ### Limitations of IntraNodeComm and ProcessGroupCudaP2p Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach: - Leads to awkward UX in which the required workspace needs to be specified upfront. - Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather). - Prevents torch.compile from eliminating all copies. In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels. * __->__ #128582 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582 Approved by: https://github.com/wanchaol	2024-06-15 10:20:21 +00:00
Wang, Eikan	60bbdc0b40	Modularize aten parameter parser and checker (#125308 ) In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`. ```C++ using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>; ``` With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used. - `Tensor` - `bool` - `int64_t` - `TensorList` - `Scalar` - `c10::SymIntArrayRef` - `::std::optional<Tensor>` - `IntArrayRef` - `double` - `c10::SymInt` - `::std::optional<ScalarType>` - `::std::optional<double>` - `::std::optional<bool>` - `::std::optional<Layout>` - `::std::optional<Device>` - `::std::optional<int64_t>` - `Dimname` - `::std::optional<Generator>` - `c10::string_view` - `::std::optional<c10::string_view>` - `OptionalIntArrayRef` - `::std::optional<Scalar>` - `OptionalSymIntArrayRef` - `::std::optional<MemoryFormat>` - `::std::optional<c10::SymInt>` - `ScalarType` - `ArrayRef<Scalar>` - `DimnameList` - `::std::optional<ArrayRef<double>>` - `::std::array<bool,3>` - `::std::optional<DimnameList>` - `c10::List<::std::optional<Tensor>>` - `::std::array<bool,2>` - `Storage` - `::std::array<bool,4>` - `Device` - `DeviceIndex` - `ITensorListRef` - `Stream` - `Layout` - `MemoryFormat` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-06-15 09:18:44 +00:00
Simon Fan	4b96575a09	[dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196 ) FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched. For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196 Approved by: https://github.com/soulitzer	2024-06-14 20:28:08 +00:00
Tristan Rice	7c370d2fb0	expose set_thread_name to Python and set thread names (#128448 ) This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process. Threads named: * torchrun/elastic * PyTorch dataloader worker processes + pin memory thread * TCPStore * ProcessGroupNCCL background threads * WorkerServer httpserver thread Test plan: ``` $ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL \| grep pt_' 3264281 3264281 pts/45 00:00:02 pt_elastic 3264281 3267950 pts/45 00:00:00 pt_elastic ``` dataloading ```py import torch import time from torch.utils.data import ( DataLoader, Dataset, ) class NoopDataset(Dataset): def __getitem__(self, index): return index def __len__(self): return 10 dataloader = DataLoader(NoopDataset(), num_workers=2) for i, x in enumerate(dataloader): print(i, x) time.sleep(10000) ``` ``` $ python3 ~/scripts/dataloader_test.py $ ps -eL \| grep pt_ 1228312 1228312 pts/45 00:00:02 pt_main_thread 1228312 1230058 pts/45 00:00:00 pt_main_thread 1228312 1230059 pts/45 00:00:00 pt_main_thread 1230052 1230052 pts/45 00:00:00 pt_data_worker 1230052 1230198 pts/45 00:00:00 pt_data_worker 1230052 1230740 pts/45 00:00:00 pt_data_worker 1230055 1230055 pts/45 00:00:00 pt_data_worker 1230055 1230296 pts/45 00:00:00 pt_data_worker 1230055 1230759 pts/45 00:00:00 pt_data_worker ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448 Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro	2024-06-13 16:38:23 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit `4c971932e8`. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
Xu Han	9e39c62908	correct avx512_vnni isa name. (#128318 ) `x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`. This PR correct the function name to `avx512_vnni`. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-06-12 16:12:49 +00:00
Kulin Seth	8df56afc20	Add support in Python API for the recommended max working set size. (#128289 ) Adds ways for users to request recommended max size for Metal on Mac. It plumbs through https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc Can be used like ``` max_memory = torch.mps.recommended_max_memory() print ("Recommended Max Memory : ", (max_memory/(102410241024)), "GB") ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289 Approved by: https://github.com/malfet	2024-06-12 16:03:57 +00:00
Aaron Orenstein	3c971d2ef3	Flip default value for mypy disallow_untyped_defs [final] (#127836 ) Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types. The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-06-12 15:28:42 +00:00
PyTorch MergeBot	c9c1fed065	Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 )" This reverts commit `c13e03c874`. Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))	2024-06-11 23:34:03 +00:00
Aaron Orenstein	c13e03c874	Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374 Approved by: https://github.com/Skylion007	2024-06-11 15:58:28 +00:00
Oguz Ulgen	5b5d269d34	Speed up fx graph iteration by implementing it in C++ (#128288 ) Before this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s) ``` After this change ``` python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s) ``` 5.7x improvement Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288 Approved by: https://github.com/jansel, https://github.com/albanD	2024-06-11 05:48:31 +00:00
eqy	4c971932e8	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-09 06:53:34 +00:00
Aaron Orenstein	dcfa7702c3	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838 Approved by: https://github.com/oulgen	2024-06-08 18:16:33 +00:00
Xu Han	ba81c3c290	[inductor] add cpp builder code. (take 2) (#125849 ) Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045 The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588 ------- It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-07 20:49:58 +00:00
Shuqiang Zhang	30788739f4	[c10d] add a simple test to demonstrate the user usage of collectives (#127665 ) Summary: Just play around the UT and think it would be good to give an simple example of user function which can be used for different subclasses of _ControlCollectives, and test the user function can be executed correctly Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665 Approved by: https://github.com/d4l3k	2024-06-05 04:32:11 +00:00
Mikayla Gawarecki	a135776307	Remove tensor subclass detection logic from weights_only unpickler (#127808 ) Remove logic to auto-detect and allow subclasses that did not override certain methods from the weights_only unpickler from https://github.com/pytorch/pytorch/pull/124331 for 2.4 release Subclasses should be loadable using `torch.serialization.add_safe_globals` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127808 Approved by: https://github.com/malfet	2024-06-05 02:14:30 +00:00
Tristan Rice	597922ba21	Reapply "distributed debug handlers (#126601 )" (#127805 ) This reverts commit `7646825c3e`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805 Approved by: https://github.com/PaliC	2024-06-04 19:44:30 +00:00
Shan19900305	3bcc3cddb5	Using scalarType instead string in function _group_tensors_by_device_and_dtype. (#127869 ) Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side. @ezyang @bdhirsh Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869 Approved by: https://github.com/ezyang	2024-06-04 18:19:33 +00:00
Jeff Daily	0e7bd7fedd	[ROCm] TunableOp improvements (#124362 ) - use less memory; smaller default hipblaslt workspace size - options to avoid cache effects - icache flush option - rotating buffers during tuning - python APIs - unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362 Approved by: https://github.com/xw285cornell	2024-06-03 22:30:11 +00:00
PyTorch MergeBot	7646825c3e	Revert "distributed debug handlers (#126601 )" This reverts commit `3d541835d5`. Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))	2024-05-31 01:21:24 +00:00
Tristan Rice	3d541835d5	distributed debug handlers (#126601 ) This adds debug handlers as described in: * https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy) * https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy) This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR. This adds 2 handlers out of the box: * `/handler/ping` for testing purposes * `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder Test plan: ``` python test/distributed/elastic/test_control_plane.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601 Approved by: https://github.com/kurman, https://github.com/c-p-i-o	2024-05-30 02:21:08 +00:00
Xuehai Pan	ba3b05fdf3	[1/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort stdlib (#127122 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122 Approved by: https://github.com/kit1980	2024-05-25 08:25:50 +00:00
Yifu Wang	4a09117d16	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-24 18:33:18 +00:00
Aaron Orenstein	70dc59c55f	Fix perf regression caused by #122074 (#126996 ) The original change was about 9.5% slower than then before #122074 . This improves it to be only about 1.4% slower. Also touched up some unrelated nits that the linter complained about. Fixes #126293 Ran torchbench 3 times on each change. Perf values before (stable), after (fix), and with #122074 backed out (backout): ``` ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp stable: 43.948x 45.754x 44.906x fix: 47.505x 49.987x 47.493x backout: 48.243x 48.199x 48.192x ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default stable: 15.224x 13.286x 15.354x fix: 16.402x 16.370x 16.183x backout: 16.554x 16.675x 16.787x ../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default stable: 1.712x 1.651x 1.640x fix: 1.804x 1.798x 1.792x backout: 1.864x 1.824x 1.836x ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996 Approved by: https://github.com/jansel	2024-05-24 04:27:22 +00:00
PyTorch MergeBot	1b29c16e5e	Revert "Introduce ProcessGroupCudaP2P (#122163 )" This reverts commit `2dd2699860`. Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))	2024-05-23 16:06:14 +00:00
Matthew Hoffman	86ad101370	Enable pickling `torch._C.Generator` (#126271 ) Fixes #71398 Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`. `__reduce__` returns a tuple of 3 values: 1. `torch.Generator` itself. 2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created. 3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor. `__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state. Added test demonstrating successful reserialization with cpu and cuda `Generator`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271 Approved by: https://github.com/ezyang	2024-05-22 14:38:47 +00:00
Yifu Wang	2dd2699860	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-22 09:33:05 +00:00
Tristan Rice	ac51920656	Reapply "c10d: add Collectives abstraction (#125978 )" (#126695 ) This reverts commit `d9c3485146`. Reapplies #125978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126695 Approved by: https://github.com/c-p-i-o	2024-05-21 18:00:09 +00:00
PyTorch MergeBot	d9c3485146	Revert "c10d: add Collectives abstraction (#125978 )" This reverts commit `4b2ae2ac33`. Reverted https://github.com/pytorch/pytorch/pull/125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/125978#issuecomment-2119858015))	2024-05-20 07:40:41 +00:00
Simon Fan	be67985bd7	[compiled autograd] log in cpp using python logger (#126483 ) Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file Pull Request resolved: https://github.com/pytorch/pytorch/pull/126483 Approved by: https://github.com/jansel ghstack dependencies: #126144, #126146, #126148	2024-05-19 23:49:52 +00:00
Mikayla Gawarecki	66dc8fb7ff	Allow tensor subclasses and add `torch.serialization.add_safe_globals` that allows users to allowlist classes for `weights_only` load (#124331 ) #### Conditions for allowlisting tensor subclasses We allow tensor subclasses types that (1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`) (2) Use the generic `tp_alloc` (3) Are in a module that has been imported by the user to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2` Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution. The rationale for the 3 conditions above is as follows: The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`) `4e66aaa010/torch/_tensor.py (L57-L71)` `as_subclass` is implemented with a call to `THPVariable_NewWithVar` that will eventually call `tp_alloc` here `4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)` The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc` Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling* ### How do we check something is a tensor subclass/constraints around imports In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We do not arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)` This PR also allowlisted `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`) ### API for allow listing This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe). Next steps: - Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331 Approved by: https://github.com/albanD	2024-05-17 17:56:57 +00:00
Tristan Rice	4b2ae2ac33	c10d: add Collectives abstraction (#125978 ) This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives. Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR. Test plan: ``` python test/distributed/test_collectives.py -v ``` This tests both functionality using multiple threads as well as timeout behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978 Approved by: https://github.com/shuqiangzhang	2024-05-17 05:09:11 +00:00
Gustav Larsson	52fad83335	[onnx.export] Avoid linear look up in env for exist_in_env (#124909 ) This PR is part of a series of PRs to significantly speed up torch.onnx.export for models with many nodes (e.g. LLM). See #121422 for more analysis. - As part of torch.onnx.export, a reverse look-up is made in env. This is done for each node, and this look-up costs in proportional to the graph size, which incurs and overall O(N^2) time complexity. - A pragmatic solution is simply to keep a separate data structure to make this de facto constant time. So, this introduces a set containing all the values of env. Open to other ideas. Ideally `exist_in_env` wouldn't be needed at all, but to preserve current behavior exactly I'm not sure how that can be done. - Resolves (4) in #121422. - This code change and the choice of py::set looks a bit more natural on top of #123063, where the env is changed from a std::unordered_map to a py::dict. Partially fixes #121422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124909 Approved by: https://github.com/srikris-sridhar, https://github.com/justinchuby	2024-05-09 22:38:00 +00:00
PyTorch MergeBot	6fd745255e	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit `3f36145db2`. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/izaitsevfb due to Fails internal builds with: no member named 'uuid' in 'hipDeviceProp_t' ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2103315320))	2024-05-09 19:52:45 +00:00
Jeff Daily	3f36145db2	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-08 19:15:55 +00:00
PyTorch MergeBot	2e237fcd70	Revert "[inductor] add cpp builder code. (#124045 )" This reverts commit `469383755f`. Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/clee2000 due to broke inductor/test_codecache and inductor/test_max_autotune `469383755f` https://github.com/pytorch/pytorch/actions/runs/8996772350/job/24724775182 ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2100851419))	2024-05-08 15:33:20 +00:00
Xu Han	469383755f	[inductor] add cpp builder code. (#124045 ) Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug. I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time. Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-05-08 05:27:15 +00:00
PyTorch MergeBot	2f79a18324	Revert "[inductor] add cpp builder code. (#124045 )" This reverts commit `7864d287a1`. Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing trunk jobs `7864d287a1` including lint ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2099306071))	2024-05-07 21:04:49 +00:00
Xu Han	7864d287a1	[inductor] add cpp builder code. (#124045 ) Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug. I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time. Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-05-07 20:07:41 +00:00
Aaron Orenstein	b23b6e7108	Ensure that vmap is restored properly if an exception is thrown during frame eval (#122074 ) We save and restore the DynamicLayerStack during frame eval but since fx graph has no way to express a try/finally we just assume it will happen. If we throw an exception between the push and pop to the stack then we're left in a state that affects following operations poorly. Make sure that if it's in a bad state we restore it after frame eval. Repro: before: ``` $ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8 $ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8' ============= 1 passed, 8588 deselected in 9.75s ============= $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8' ================== short test summary info =================== FAILED [0.0632s] test/test_sparse.py::TestSparseCPU::test_log1p_cpu_uint8 - AssertionError: "only Tensors of floating point dtype can require gradients" does not match "You are attempting to call Tensor.requires_grad_() (or perhaps using torch.autograd.functional.* APIs) inside of a function ... ======= 1 failed, 1 skipped, 8587 deselected in 10.99s ======= ``` (Note that adding test_vmap_free_tensor_dynamic_shapes causes test_vmap_free_tensor_dynamic_shapes to fail) after: ``` $ rm test/dynamo_skips/TestSparseCPU.test_log1p_cpu_uint8 $ rm test/dynamo_expected_failures/FuncTorchHigherOrderOpTests.test_vmap_free_tensor $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_log1p_cpu_uint8' ============= 1 passed, 8588 deselected in 9.89s ============= $ PYTORCH_TEST_WITH_DYNAMO=1 pytest test/jit/test_sparse.py test/dynamo/test_dynamic_shapes.py test/inductor/test_torchinductor_dynamic_shapes.py test/test_sparse.py -k 'test_vmap_free_tensor_dynamic_shapes or test_log1p_cpu_uint8' ======= 1 passed, 1 skipped, 8587 deselected in 11.34s ======= ``` (test_vmap_free_tensor_dynamic_shapes passes either way) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122074 Approved by: https://github.com/oulgen	2024-05-07 19:36:52 +00:00
PyTorch MergeBot	5fd0b6e5f7	Revert "add uuid in cudaDeviceProperties (#125083 )" This reverts commit `f35fe4eaf1`. Reverted https://github.com/pytorch/pytorch/pull/125083 on behalf of https://github.com/clee2000 due to test_uuid is flaky. ex https://github.com/pytorch/pytorch/actions/runs/8988855916/job/24692369523 https://hud.pytorch.org/flakytest?name=test_uuid&suite=TestCuda&file=%25&limit=300 ([comment](https://github.com/pytorch/pytorch/pull/125083#issuecomment-2099029993))	2024-05-07 18:16:27 +00:00
Jeff Daily	f35fe4eaf1	add uuid in cudaDeviceProperties (#125083 ) Replaces #99967. Fixes #99903. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083 Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy	2024-05-07 01:26:01 +00:00
Randolf Scholz	ccaf03fd89	Fix: `nn.Parameter` return type identified as `Tensor` instead of `nn.Parameter` (#125106 ) Fixes #125105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125106 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-29 23:25:23 +00:00
Simon Fan	43a7ab2a21	[compiled autograd] introduce verbose logs, add autograd node info to graph (#124954 ) - sets it as a fake stack trace as we don't have a generic comment feature - when verbose is disabled, still adds a contextmanager and flag checks. the alternative is to use MACROS, but that wouldn't be usable with TORCH_LOGS Pull Request resolved: https://github.com/pytorch/pytorch/pull/124954 Approved by: https://github.com/jansel	2024-04-27 01:10:37 +00:00
egienvalue	73744a2c00	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-26 16:17:54 +00:00
Aaron Orenstein	609c958281	Fix mypy issues in fake_tensor.py (#124428 ) fake_tensor.py had mypy error ignored. That seems less than desirable. Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees). Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428 Approved by: https://github.com/malfet	2024-04-26 15:35:53 +00:00
PyTorch MergeBot	f131c2c199	Revert "Fix mypy issues in fake_tensor.py (#124428 )" This reverts commit `25c0d3f3f0`. Reverted https://github.com/pytorch/pytorch/pull/124428 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
PyTorch MergeBot	e04c7b19f4	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit `381653de63`. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to this PR broke ROCm with message RuntimeError: Cannot have MTIA with other devices ([comment](https://github.com/pytorch/pytorch/pull/123612#issuecomment-2077649762))	2024-04-25 16:06:46 +00:00
Yu, Guangye	cdc66e9dc3	refactor autocast python APIs (#124479 ) # Motivation Refactor autocast usage scenario in `torch/amp/autocast_mode.py` and `torch/utils/checkpoint.py` to fix the bug - convention conflict between `torch.xxx.get_autocast_xxx_dtype` defined in `autocast_mode.py` and `torch.xxx.get_autocast_dtype` defined in `checkpoint.py`. # Solution Use device-agnostic APIs like `torch.get_autocast_dtype`, ..., instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124479 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #124359	2024-04-25 14:33:33 +00:00
Aaron Orenstein	25c0d3f3f0	Fix mypy issues in fake_tensor.py (#124428 ) fake_tensor.py had mypy error ignored. That seems less than desirable. Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees). Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428 Approved by: https://github.com/malfet	2024-04-25 14:07:53 +00:00
Animesh Jain	e68d65dae2	[dynamo][cpp-guards] Differentiate dict guards wrt to guarding on key order (#124779 ) We guard on key order 1) When a key is a non-constant object 2) When we actually need key order - like .values, .items etc For dicts/OrderedDicts that do not require key order guarding, we just rely on usual `GuardManger + DictGetItemGuardAccessor`. This is faster than going through the `list(d.keys())` based design for OrderedDicts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124779 Approved by: https://github.com/jansel	2024-04-25 08:20:35 +00:00
egienvalue	381653de63	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- Differential Revision: [D56443356](https://our.internmc.facebook.com/intern/diff/D56443356) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-24 20:51:20 +00:00
egienvalue	408aa0182c	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D56443357](https://our.internmc.facebook.com/intern/diff/D56443357) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD, https://github.com/jeffdaily	2024-04-24 20:51:17 +00:00
Yu, Guangye	25f321b84f	Refactor autocast C++ APIs to be device-agnostic (#124359 ) # Motivation This PR aims to refactor autocast C++ APIs to be device-agnostic and deprecate the device-specific autocast C++ APIs. In C++ side, - `is_enabled()` -> `is_enabled(device_type)`. - `set_enabled(new_enabled)` -> `set_enabled(device_type, new_enabled)`. - `get_autocast_dtype()` -> `get_autocast_dtype(device_type)` - `set_autocast_dtype(dtype)` -> `set_autocast_dtype(device_type, dtype)` These following C++ APIs are deprecated and should be removed in PyTorch 2.5 - `is_cpu_enabled` - `set_cpu_enabled` - `get_autocast_cpu_dtype` - `set_autocast_cpu_dtype` - `is_xpu_enabled` - `set_xpu_enabled` - `get_autocast_xpu_dtype` - `set_autocast_xpu_dtype` - `is_ipu_enabled` - `set_ipu_enabled` - `get_autocast_ipu_dtype` - `set_autocast_ipu_dtype` - `is_hpu_enabled` - `set_hpu_enabled` - `get_autocast_hpu_dtype` - `set_autocast_hpu_dtype` - `is_xla_enabled` - `set_xla_enabled` - `get_autocast_xla_dtype` - `set_autocast_xla_dtype` - `is_privateuseone_enabled` - `set_privateuseone_enabled` - `get_autocast_privateuseone_dtype` - `set_autocast_privateuseone_dtype` In Python side, provide 4 generic autocast APIs: - `torch.is_autocast_enabled(device_type)` - `torch.set_autocast_enabled(device_type, new_enabled)` - `torch.get_autocast_dtype(device_type)` - `torch.set_autocast_dtype(device_type, dtype)` # Additional Context We will submit another PR to refactor autocast Python APIs based on this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124359 Approved by: https://github.com/jgong5, https://github.com/albanD	2024-04-23 10:38:50 +00:00
Ashwin Hari	5f5778476a	rename ort to maia (#123265 ) Fixes #123264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265 Approved by: https://github.com/albanD	2024-04-23 00:33:25 +00:00
Jeff Daily	6ede882c0b	preferred blas library; cublaslt gemm implementation (#122106 ) Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources. The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106 Approved by: https://github.com/lezcano	2024-04-22 15:38:22 +00:00
Chen, Zejun	b1984237a0	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi	2024-04-22 01:26:55 +00:00
PyTorch MergeBot	0feab7d6c3	Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 )" This reverts commit `cb17721899`. Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	929242a15c	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit `d7e1bf9ff9`. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
ydwu4	e62169a8fa	Support torchbind op dispatch in python (#123367 ) We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table. The idea is: 1. when inputs contains FakeScriptObject, we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor. 2. when inputs are not fakified, we dispatch through the original c++ dispatcher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367 Approved by: https://github.com/zou3519	2024-04-19 17:17:27 +00:00
PyTorch MergeBot	520bc1080e	Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 )" This reverts commit `768ce2cdda`. Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))	2024-04-19 09:09:03 +00:00
Chen, Zejun	768ce2cdda	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-04-19 03:31:13 +00:00
egienvalue	d7e1bf9ff9	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- @exported-using-ghexport Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-18 17:38:06 +00:00
egienvalue	cb17721899	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def query(self) -> _bool: ... def synchronize(self) -> None: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD	2024-04-18 17:35:09 +00:00
rzou	648c39c47d	Add OpOverload.redispatch; use it in new custom ops API (#124089 ) A kernel has "dispatcher convention" if there is an additional keyset arg at the beginning of the argument list. This PR: - adds a way to register kernels with dispatcher_convention using Library.impl (pass dispatcher_convention = True) - adds OpOverload.redispatch We use both of the above in the new custom ops API: we register the autograd kernel in dispatcher convention so that we can actually call redispatch like how pytorch built-in ops do it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071	2024-04-18 12:48:04 +00:00
PHLens	9aba918bd8	Support Accelerator OOM Error (#121200 ) (#121702 ) Fixes #121200 This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-04-15 21:41:46 +00:00
NiTIAN	3dde6a461f	fix cpp path in torch/_C/_autograd.pyi (#123924 ) The file `tools/autograd/init.cpp` does not exist, I think the right path is `torch/csrc/autograd/init.cpp`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123924 Approved by: https://github.com/Skylion007	2024-04-12 22:32:00 +00:00
Shengbao Zheng	4e9094533e	[c10d/nccl-pg] allow user to pass process group description (#123472 ) Summary: We need a way to allow user set a customized description for a process group, e.g. FSDP, PP. Here are several use cases of user specified group_desc: - Logging: we can easily match a log line and understand what's this collective/pg is used to. - Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP. - Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG. Solution: Add a group_desc field to c10d Differential Revision: D55781850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472 Approved by: https://github.com/kwen2501	2024-04-12 08:44:21 +00:00
Shivam Raikundalia	3ebbeb75fd	[Profiler] Make Kineto traces export ns granularity for finer timestamps (#122425 ) (#123650 ) Summary: Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself. This diff contains profiler changes only. Libkineto changes found in D54964435. Test Plan: Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master. Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189 Ran key_averages() to make sure FunctionEvent code working as expected: -- ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5 Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5 ## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5 aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265 aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265 aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265 aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265 aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425 aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265 aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Differential Revision: D55925068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650 Approved by: https://github.com/aaronenyeshi	2024-04-11 04:29:20 +00:00
Aaron Orenstein	2bcc83dfbd	Preserve dispatch state across function tracing (#122073 ) If we throw an exception in the "wrong" place we can end up with the dispatch state being in a weird state which can cause all future dispatching to fail. Preserve and restore it as part of `preserve_global_state` so we know it's sane after that. Also fake_tensor's in_kernel_invocation_manager() was leaving a bit set in the dispatcher (DispatchKey.Dense) which affected follow-on code. Fixed that to reset after as well. Repro: before: ``` $ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64 $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64' ======== 1 passed, 6173 deselected in 5.21s ============= $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64' ========= 1 skipped, 6172 deselected, 1 error in 5.29s ========= ``` (note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes on its own but failed when including the skipped test_export.py tests) after: ``` $ rm test/dynamo_skips/TestSparseCPU.test_to_dense_with_gradcheck_sparse_cpu_complex64 $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_to_dense_with_gradcheck_sparse_cpu_complex64' ===================== 1 passed, 6173 deselected in 5.42s ===================== $ PYTORCH_TEST_WITH_DYNAMO=1 pytest -s test/dynamo/test_export.py test/test_sparse.py -k 'test_torch_inference_mode_ctx or test_to_dense_with_gradcheck_sparse_cpu_complex64' ===================== 1 passed, 1 skipped, 6172 deselected in 7.30s ====================== ``` (note that test_to_dense_with_gradcheck_sparse_cpu_complex64 passes in both runs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122073 Approved by: https://github.com/zou3519	2024-04-10 18:57:01 +00:00
Shivam Raikundalia	c9c099b271	Add kwargs to RecordFunctionFast (#123600 ) Differential Revision: [D55897888](https://our.internmc.facebook.com/intern/diff/D55897888/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123600 Approved by: https://github.com/davidberard98	2024-04-10 18:17:50 +00:00
FFFrog	5c1bde99c0	Fix the uncorrect return value of Tensor.numpy() (#123538 ) Fixes #123494 As the ISSUE stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123538 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-04-10 14:47:24 +00:00
Jason Ansel	e3ea316623	[dynamo] Save/restore cublas_allow_tf32 in convert_frame (#123509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123509 Approved by: https://github.com/anijain2305	2024-04-07 03:37:47 +00:00
PyTorch MergeBot	c66d503194	Revert "[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425 )" This reverts commit `6f7dd2f84a`. Reverted https://github.com/pytorch/pytorch/pull/122425 on behalf of https://github.com/malfet due to Breaks ROCM builds ([comment](https://github.com/pytorch/pytorch/pull/122425#issuecomment-2041129241))	2024-04-06 16:19:00 +00:00
Shivam Raikundalia	6f7dd2f84a	[Profiler][submodule] Make Kineto traces export ns granularity for finer timestamps (#122425 ) Summary: Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself. This diff contains profiler changes only. Libkineto changes found in D54964435. Test Plan: Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master. Tracing with flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_37_22.4155151.pt.trace.json.gz&bucket=gpu_traces Tracing without flags enabled: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_39_15.4166047.pt.trace.json.gz&bucket=gpu_traces Tracing on main: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Mar_18_14_42_43.4177559.pt.trace.json.gz&bucket=gpu_traces Ran key_averages() to make sure FunctionEvent code working as expected: -- ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5 Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5 ## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5 aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265 aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265 aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265 aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265 aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425 aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265 aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Differential Revision: D55087993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122425 Approved by: https://github.com/aaronenyeshi	2024-04-06 06:04:28 +00:00
sraikund16	6fa72480d3	Enhance RecordFunctionFast input args and use input args in triton_heuristics.py (#123459 ) Summary: Now that we can input shapes as input args for RecordFunctionFast, let's add that to the triton heuristics. Also, lets add the ability to pass in a tuple into the RecordFunctionFast constructor. Test Plan: Ran both the _inductor/test_profile.py and profiler/test_profiler.py unit tests. Also added tuple based unit test to profiler/test_profiler.py Ran record_function_fast.py from the following branch https://github.com/pytorch/pytorch/compare/sraikund/record_funct_test?expand=1 No shape or args: tests function fast with no args and profile without record_shapes With shape tests: tests function fast with args and profile with record_shapes true Args no shape: tests function fast with args inputted but record_shapes set to false Args shape tuple: tests function fast with args inputted in form of tuple and record_shapes true Stdout: No shape or args:: 1.8491458892822266 us With shape:: 2.211381196975708 us Args no shape:: 1.9212646484375 us With shape tuple:: 2.245788335800171 us Differential Revision: D55809967 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123459 Approved by: https://github.com/davidberard98	2024-04-06 02:44:06 +00:00
rzou	81e7a7c955	Add mutated_args field to custom_op (#123129 ) If provided, we: - autogenerate an ADInplaceOrView implementation - assume that no mutated inputs are returned as outputs. There are already aliasing runtime checks that check this. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123129 Approved by: https://github.com/albanD ghstack dependencies: #123108, #123109, #123110	2024-04-05 22:03:51 +00:00
rzou	067851dd0d	Expand is_functional_schema to work with torch._C._FunctionSchema (#123108 ) Previously it worked with torchgen.model.FunctionSchema. This PR extends it to work with torch._C._FunctionSchema by making torchgen.model.FunctionSchema look more like torch._C._FunctionSchema. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123108 Approved by: https://github.com/albanD	2024-04-05 22:03:39 +00:00
rzou	d486cb7c1b	Deprecate calling FakeTensor.data_ptr in eager-mode (#123292 ) Today, we error out on FakeTensor.data_ptr under torch.compile. This PR moves to error out on FakeTensor.data_ptr under eager mode to avoid diverging behavior. We do this by adding another bit onto FakeTensor that we'll remove after the deprecation cycle. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/123292 Approved by: https://github.com/eellison ghstack dependencies: #123261, #123282, #123291	2024-04-04 20:35:24 +00:00
Shivam Raikundalia	4732375042	make RecordFunctionFast take inputs (#123208 ) Summary: RECORD_FUNCTION in C++ and torch.profiler.record_function already support recording inputs. Let's do the same for RecordFunctionFast. Test Plan: Add tests in test_profiler.py that take args and also do not take args so we can support it being an optional parameter Differential Revision: D55648870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123208 Approved by: https://github.com/davidberard98	2024-04-03 21:58:09 +00:00
Yu, Guangye	eb7adc3ae0	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-30 13:04:38 +00:00
Yu, Guangye	f4ff063c33	Add attributes to xpu device prop (#121898 ) # Motivation Add some attributes to `XPUDeviceProp` and expose them via `torch.xpu.get_device_properties` and `torch.xpu.get_device_capability`. They can be used in `torch.compile` or directly passed to triton to generate more optimized code based on device properties. # Additional Context expose the following attributes to `torch.xpu.get_device_properties`： - `has_fp16` (newly added) - `has_fp64` (newly added) - `has_atomic64` (newly added) - `driver_version` - `vendor` - `version` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121898 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet, https://github.com/albanD, https://github.com/atalman	2024-03-30 00:25:39 +00:00
William Wen	35382f0573	[dynamo, 3.12] Use CPython internal _PyOpcode_Caches instead of hardcoding (#122335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122335 Approved by: https://github.com/jansel ghstack dependencies: #122146	2024-03-27 20:39:39 +00:00
Frank Lin	249e65b92d	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9	2024-03-27 01:14:38 +00:00
rzou	c81c9ba472	Disallow {FakeTensor,FunctionalTensor}.data_ptr (#122514 ) This PR: - disallows FakeTensor.data_ptr when it is called inside PT2 or fx tracing. - disallows FunctionalTensor.data_ptr (python FunctionalTensor is only used in PT2) The motivation behind this is that the leading cause of segfaults when using custom ops with PT2 is calling .data_ptr on FunctionalTensor or FakeTensor. This change is BC-breaking. If your code broke as a result of this, it's because there was a bug in it (these .data_ptr should never be accessed!). You can either fix the bug (recommended) or get the previous behavior back with: ``` from torch._subclasses.fake_tensor import FakeTensor from torch._subclasses.functional_tensor import FunctionalTensor data_ptr = 0 if isinstance(tensor, (FakeTensor, FunctionalTensor)) else tensor.data_ptr() ``` Test Plan: - existing tests Differential Revision: [D55366199](https://our.internmc.facebook.com/intern/diff/D55366199) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122514 Approved by: https://github.com/ezyang, https://github.com/albanD, https://github.com/yifuwang, https://github.com/kurtamohler	2024-03-26 23:55:42 +00:00
Bin Bao	537cd66e73	[Inductor] Support custom op in JIT with cpp wrapper (#122554 ) Summary: To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python. https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code. Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-26 18:48:45 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit `e9dcda5cba`. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
Edward Z. Yang	5891c5b3a6	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison	2024-03-25 06:21:17 +00:00
Guilherme Leobas	4eaa000acc	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-22 20:25:47 +00:00
PyTorch MergeBot	f65373e278	Revert "Factor meta conversion through serializable MetaTensorDesc (#122044 )" This reverts commit `e2d89e9704`. Reverted https://github.com/pytorch/pytorch/pull/122044 on behalf of https://github.com/jeanschmidt due to Seems that some landrace caused this PR to break lint ([comment](https://github.com/pytorch/pytorch/pull/122044#issuecomment-2015025490))	2024-03-22 12:46:21 +00:00
Edward Z. Yang	e2d89e9704	Factor meta conversion through serializable MetaTensorDesc (#122044 ) Fixes https://github.com/pytorch/pytorch/issues/121085 This PR pretty involved so pay attention to this description. At a high level, the refactor is intended to be mechanical: anywhere in MetaConverter where previously we took a Tensor as argument, we now take a MetaTensorDesc, which contains all of the information that we would have queried off of the Tensor, but placed into a separate data structure which we can serialize or use to recreate a fake tensor in a separate fake tensor mode in exact fidelity to the original. However, this transformation is not always entirely mechanical. Here is what you need to pay attention to: - The memo table from real Tensor -> meta/fake Tensor is now broken into two memo tables: real Tensor -> stable int id -> meta/fake Tensor. The stable int id is needed so that when we do serialization, we know when tensors/storages alias each other and can ensure we preserve this aliasing upon deserialization. The way I have implemented changes the weak reference behavior. Previously, when either the real Tensor OR the meta/fake Tensor went dead, we would remove the entry from the memo table. Now, this only removes entries from one of the two memo tables. This semantically makes sense, because the user may have held on to the stable int id out of band, and may expect a real Tensor to continue to be numbered consistently / expect to be able to lookup a meta/fake tensor from this id. If this is unacceptable, it may be possible to rejigger the memo tables so that we have real Tensor -> stable int id and real Tensor -> meta/fake Tensor, but TBH I find the new implementation a lot simpler, and arranging the memo tables in this way means that I have to muck around with the real tensor to save to the memo table; in the current implementation, I never pass the Tensor to meta_tensor function AT ALL, which means it is impossible to accidentally depend on it. - When I fill in the fields of MetaTensorDesc in describe_tensor, I need to be careful not to poke fields when they are not valid. Previously, preconditions were implicitly checked via the conditional structure ("is this sparse? is this nested?") that is tested before we start reading attributes. This structure has to be replicated in describe_tensor, and I have almost assuredly gotten it wrong on my first try (I'll be grinding through it on CI; a careful audit will help too, by auditing that I've tested all the same conditionals that the original access was guarded by.) - I originally submitted https://github.com/pytorch/pytorch/pull/121821 for the symbolic shapes change, but it turned out the way I did it there didn't actually work so well for this PR. I ended up just inlining the symbolic shapes allocation logic into MetaConverter (look for calls to maybe_specialize_sym_int_with_hint), maybe there is a better way to structure it, but what I really want is to just read sizes/strides/offset directly off of MetaTensorDesc; I don't want another intermediate data structure. - Some fields aren't serializable. These are documented as "NOT serializable". ctx/type should morally be serializable and I just need to setup a contract with subclasses to let them be serialized. The fake_mode is used solely to test if we are refakefying with a pre-existing ShapeEnv and we want to reuse the SymInt directly--serializing this case is hopeless but I am kind of hoping after this refactor we do not need this at all. view_func is not serializable because it's a bound C implemented method. Joel has promised me that this is not too difficult to actually expose as a true data structure, but this is the edgiest of edge cases and there is no reason to deal with it right now. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122044 Approved by: https://github.com/eellison ghstack dependencies: #122018	2024-03-22 03:56:34 +00:00
PyTorch MergeBot	968c4c4154	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit `74deacbf31`. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk `74deacbf31`, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))	2024-03-21 20:33:17 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Yu, Guangye	74deacbf31	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-21 01:52:58 +00:00
Michael Lazos	c20cf97366	Move some cudagraphs checks into C++ (#122251 ) Based off of https://github.com/pytorch/pytorch/pull/111094 This + cpp guards improves TIMM geomean optimizer performance by about 20% Pull Request resolved: https://github.com/pytorch/pytorch/pull/122251 Approved by: https://github.com/eellison	2024-03-21 01:02:23 +00:00
PyTorch MergeBot	0696db8202	Revert "Teach dynamo about torch.func.jvp (#119926 )" This reverts commit `17489784b6`. Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/peterbell10 due to broken mac jobs on main ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2010327997))	2024-03-20 18:34:43 +00:00
Guilherme Leobas	17489784b6	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-20 13:09:19 +00:00
PyTorch MergeBot	36e5c1dcab	Revert "Teach dynamo about torch.func.jvp (#119926 )" This reverts commit `edd04b7c16`. Reverted https://github.com/pytorch/pytorch/pull/119926 on behalf of https://github.com/jeanschmidt due to lots of breakages in pull jobs, checking if reverting this one will help ([comment](https://github.com/pytorch/pytorch/pull/119926#issuecomment-2007915919))	2024-03-19 18:59:46 +00:00
PyTorch MergeBot	f9ed1c432d	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit `0ff1109e26`. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/jeanschmidt due to Reverting to see if rocm trunk errors are related ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2007519408))	2024-03-19 15:40:26 +00:00
Guilherme Leobas	edd04b7c16	Teach dynamo about torch.func.jvp (#119926 ) List of changes: - Replace JVP_NESTING by torch._C._functorch.maybe_current_level() - Remove all increment nesting functions from wrap_fx_proxy_cls - fwAD.make_dual receives the dual_level as keyword argument - Add jvp_increment_nesting, set_fwd_grad_enabled and dual_level context managers to dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/119926 Approved by: https://github.com/zou3519	2024-03-19 13:06:42 +00:00
Yu, Guangye	0ff1109e26	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-19 06:02:28 +00:00
Animesh Jain	8860c625ea	[dynamo][guards-cpp-refactor] Integrate cpp guard manager with CheckFnManager (#120726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120726 Approved by: https://github.com/jansel	2024-03-19 03:11:31 +00:00
andrewor14	773ae817f7	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Differential Revision: [D54805279](https://our.internmc.facebook.com/intern/diff/D54805279) Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-18 21:01:30 +00:00
Bin Bao	8be80706b4	[AOTI] Add pybind for tensor_converter util functions (#121744 ) Differential Revision: [D54818716](https://our.internmc.facebook.com/intern/diff/D54818716) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121744 Approved by: https://github.com/chenyang78 ghstack dependencies: #121523, #121743	2024-03-14 22:20:51 +00:00
Zihua Wu	d62bdb087d	[Profiler] add missing field device_resource_id (#121480 ) Fixes #121479 Co-authored-by: Aaron Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121480 Approved by: https://github.com/aaronenyeshi	2024-03-12 21:42:53 +00:00
PyTorch MergeBot	fd0dbcd891	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `7b4f70eda5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/osalpekar due to Causes build failure in //caffe2:aten-hip (AMD build) target. See [D54707318](https://www.internalfb.com/diff/D54707318) for more details, may require internal build system changes to resolve. ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1989542965))	2024-03-11 22:22:41 +00:00
andrewor14	7b4f70eda5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-08 15:07:15 +00:00
Shengbao Zheng	60aaba4128	create function to get ProcessGroupNCCL uid (#121132 ) Summary: expose ProcessGroupNCCL uid Differential Revision: D54446056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132 Approved by: https://github.com/aaronenyeshi	2024-03-07 18:34:38 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
PyTorch MergeBot	8087912622	Revert "[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 )" This reverts commit `0ab2ec3738`. Reverted https://github.com/pytorch/pytorch/pull/120185 on behalf of https://github.com/briancoutinho due to This PR contains a list search in '_parse_kineto_events()' that can lead to very high cost of running this post trace, training jobs getting stuck for mins ([comment](https://github.com/pytorch/pytorch/pull/120185#issuecomment-1980180774))	2024-03-06 06:39:51 +00:00
Tugsbayasgalan Manlaibaatar	5680f565d5	Batch Norm Consolidation (#116092 ) Summary: This commit simplifies the existing decomposition hierarchy of batch norm ops by adding a single, backend agnostic op: `batch_norm_with_update`. The existing hierarchy looks like: ``` aten.batch_norm -> aten._batch_norm_impl_index -> [ aten.native_batch_norm -> aten._native_batch_norm_legit (export only) -> _batch_norm_legit_cpu/cuda (kernels, export only) -> _batch_norm_cpu/cuda (kernels) ] OR [ aten.cudnn_batch_norm ] OR [ aten.miopen_batch_norm ] ``` Aside from complexity, an important problem with the above decomposition hierarchy is cuda numerics in export flows. We observed significantly worse convergence when training a mobilenetv2-like model when using the `_batch_norm_cuda` kernel instead of the `cudnn_batch_norm` kernel. This means users who export their models on CPU first then move the models to cuda later may silently see worse accuracies even when cudnn is installed, because they are using the worse kernel. This issue is summarized in https://github.com/pytorch/pytorch/issues/111384. Instead, the new hierarchy proposed by consolidating existing batch norm ops will look like: ``` aten.batch_norm -> aten.batch_norm_with_update -> [ _batch_norm_cpu (kernel) ] OR [ _batch_norm_cuda (kernel) ] OR [ cudnn_batch_norm (kernel) ] OR [ miopen_batch_norm (kernel) ] ``` The new op `batch_norm_with_update` hides backend implementation details and automatically picks the right kernel based on what is installed. This commit also adds the following variants to this op: ``` batch_norm_with_update_functional batch_norm_with_update.out batch_norm_no_update batch_norm_no_update.out batch_norm_backward ``` Note that this commit only adds this op and its variants, but does not actually change the decomps to produce these ops in the graph. This will be done after the 2 week FC window, and the ops used in the old stack is planned to be removed after the 6 month BC window. Test Plan: `OpInfo` tests for `batch_norm_with_update`. Reviewers: albanD, bdhirsh Subscribers: albanD, bdhirsh, supriyar Tasks: https://github.com/pytorch/pytorch/issues/111384 Co-authored-by: Tugsbayasgalan Manlaibaatar <tmanlaibaatar@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116092 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-03-06 04:50:46 +00:00
Scott Wolchok	cac36e232e	[PyTorch] Split StaticModule out of test_static_runtime (#121028 ) I want to use StaticModule in another (internal) test, so splitting it out. Differential Revision: [D54384817](https://our.internmc.facebook.com/intern/diff/D54384817/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121028 Approved by: https://github.com/suo	2024-03-05 23:14:07 +00:00
rzou	3ef0befdc9	Better error messages for impl_abstract_pystub (#120959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120959 Approved by: https://github.com/drisspg	2024-03-04 15:24:36 +00:00
albanD	8cb4855d1e	Release the GIL in serialization when it is safe to do so (#120818 ) In particular this ensures we release the GIL when serializing: - PyBytes objects (this is how we get the pickle object) - Storage objects Other string-like objects keep the gil which is fine because we only use this for very small strings today (for endianess) and so releasing the GIL is not important there Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120818 Approved by: https://github.com/colesbury	2024-03-01 22:37:26 +00:00
Guilherme Leobas	491c2b4665	Let torch dynamo inline torch.func.grad (#118407 ) When dynamo sees torch.func.grad, it tries to inline all frames related to. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118407 Approved by: https://github.com/zou3519	2024-02-28 20:05:00 +00:00
Xunsong, Huang	0ab2ec3738	[XPU][Profiler] Add Logic To The Profiler For Processing XPU-backend Data (#120185 ) This pull request is writing to provide an update on the recent advancements made in the PyTorch profiler with regards to XPU backend support. Following the successful merge of a previous pull request #94502 that established a pathway for the XPU backend within PyTorch, we have now taken steps to enhance the profiler's capabilities for handling and displaying profile data directly related to the XPU backend. # Motivation The current pull request builds upon this foundation by refining the profiler's data processing scripts, particularly `profiler_util.py`, to accommodate XPU backend-specific profile data. The aim is to align the handling and presentation of this data with that of the CUDA backend, offering users a consistent experience across different device profiles. This includes generating outputs such as JSON files compatible with Chrome trace tooling, among other formats. # Principles 1. Minimal Impact: The modifications introduced should support XPU backend data with minimal disruption to the existing profiling scripts. 2. Consistency: Changes should maintain stylistic and functional consistency with existing `CUDA` and `privateuse1` pathways, ensuring no adverse effects on other logic paths. 3. Exclusivity: Ensure that the new XPU pathway does not interfere with or impede other pathways. # Solutions ### a. Pathway Identification: Introduction of a `use_xpu` flag within `torch.autograd.profiler.profile` interfaces to distinguish XPU-specific profiling. ### b. `use_device` Logic Revision: With the introduction of the XPU pathway, `use_device` no longer implies a binary relationship with `use_cuda`. Consequently, we have revised related logic to remove implicit assertions and establish independent device distinction. ### c. Kernel List Segregation: To accommodate the non-binary nature of device pathways, we have enabled kernel lists to identify specific device affiliations through separate list objects. ### d. Formatted Output: To ensure output consistency, we have employed code duplication and keyword substitution techniques to facilitate the formatting of XPU-related profile data. # Additional Enhancements ### a. Enumerations in `.pyi` Files: Added recognition items for `DeviceType` and `ProfilerActivity` specific to XPU. ### b. Correct DeviceType Returns: Revised `deviceTypeFromActivity` logic to accurately differentiate between device backends, even when they share common flags such as `libkineto::ActivityType::GPU_MEMCPY`. ### c. Bug Fixes in `cuda_corr_map`: Addressed a corner case where erroneous parent-child event relationships were formed due to shared function event identifiers. The solution involves refining `cuda_corr_map` processing to prevent a function event from being misidentified as both the linker and linkee. # Further Abstraction Looking forward, we acknowledge the potential for further abstraction in the codebase. The current changes necessitated by XPU support have highlighted opportunities for reducing redundancy by consolidating naming conventions and utilizing a singular `device` naming system that relies on `DeviceType` attributes or string flags for differentiation. This would involve significant refactoring to replace device-specific flags and variables. This topic needs further discussions about whether we could and when we should deprecate all those flags and variables named with `cuda`. # Next Pull Request The next pull request will be contingent on Kineto's adoption of Intel's forthcoming PTI-sdk library, which will enable direct usage of XPU-related tracers. Subsequent modifications to `libkineto_init()` will aim to endow PyTorch running on XPU backends with comprehensive profiling capabilities on XPU devices. We appreciate your attention to these enhancements and welcome any feedback or questions you may have regarding these developments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120185 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-02-28 17:50:32 +00:00
Isuru Fernando	f7e79299c7	register torch.return_types in torch.fx._pytree (#120027 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120027 Approved by: https://github.com/lezcano, https://github.com/zou3519, https://github.com/XuehaiPan ghstack dependencies: #119284	2024-02-23 21:52:42 +00:00
Isuru Fernando	c3496d50f0	Fix torch.return_types init signature (#119284 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119284 Approved by: https://github.com/peterbell10, https://github.com/XuehaiPan	2024-02-23 21:52:34 +00:00
soulitzer	312ce35c1f	Rename singleton int to nested int (#119661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661 Approved by: https://github.com/ezyang	2024-02-16 19:21:17 +00:00
Yu, Guangye	8f9f12c068	Intel GPU Runtime Upstreaming for Device Allocator (#118091 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel. # Design In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below. <p align="center"> <img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218"> </p> # Additional Context We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`. Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR. In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`. The differences with CUDA: only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment... Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #117611, #117619, #117734	2024-02-16 06:46:00 +00:00
Yu, Guangye	4dc75f9084	Intel GPU Runtime Upstreaming for Event (#117734 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`. # Design `XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively. # Additional Context It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA. lack of the below APIs: - `torch.cuda.Event.ipc_handle` - `CUDAEvent`'s constructor with `IpcEventHandle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117611, #117619	2024-02-16 06:28:26 +00:00
Eddie Yan	cd380c794f	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-02-14 22:02:06 +00:00
Guilherme Leobas	3319dbcd23	Update vmap guard to avoid recompilations (#119061 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119061 Approved by: https://github.com/zou3519	2024-02-13 20:50:23 +00:00
Shuqiang Zhang	893dcac068	[c10d] explicitly abort communicators in destroy_process_group call (#119250 ) Summary: This PR tries to resolve issue #119215. Basically, processgroup shutdown (and hence ncclCommAbort) is called in destroy_process_group APIs for the corresponding PGs. and in the destructor of ProcessGroup, we avoid calling abort/ncclCommAbort. Instead, it just checks if the users have explicitly already called destroy_process_group. If not, Destructor will log a warning and encourage/expect users to do so for cleanup of resources of PGs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250 Approved by: https://github.com/minsii, https://github.com/kwen2501	2024-02-12 18:40:28 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
William Wen	ee1c2449f7	[dynamo] delete dynamo cache entry when guard function is invalidated [attempt 2] (#119107 ) Attempt #2 for https://github.com/pytorch/pytorch/pull/117875 to fix https://github.com/pytorch/pytorch/issues/112090. Summary of changes: - ~Changed CacheEntry linked list into a doubly-linked list structure to support deletion.~ (done by C++ refactor) - Added CacheEntry and ExtraState borrowed references to GuardFn so that GuardFn can tell ExtraState to delete CacheEntry when the GuardFn is invalidated. - ~Added ExtraState raw reference to CacheEntry so that we can get ExtraState to correctly point to the first CacheEntry if it gets deleted.~ (done by C++ refactor) - CacheEntry destructor needs to reset GuardFn refs to ExtraState/CacheEntry in order to prevent use-after-free. - code_context values that are nn.GraphModules need to be weakrefs in order to prevent circular references. - Added tests that check for memory leaks and cache deletion operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119107 Approved by: https://github.com/jansel	2024-02-07 03:32:42 +00:00
William Wen	ae4e866bba	[dynamo] refactor CacheEntry and ExtraState to eval_frame.c to C++ (#118438 ) Part of implementing CacheEntry invalidation to fix https://github.com/pytorch/pytorch/issues/112090. Changes: - Move CacheEntry and ExtraState to C++ - Use pybind to control reference counting - Use std::list instead of manually implementing a linked list Pull Request resolved: https://github.com/pytorch/pytorch/pull/118438 Approved by: https://github.com/jansel	2024-02-06 20:48:11 +00:00
Yu, Guangye	a205e7bf56	[3/4] Intel GPU Runtime Upstreaming for Device (#116850 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this third PR covers the changes under `libtorch_python`. # Design This PR primarily offers device-related APIs in python frontend, including - `torch.xpu.is_available` - `torch.xpu.device_count` - `torch.xpu.current_device` - `torch.xpu.set_device` - `torch.xpu.device` - `torch.xpu.device_of` - `torch.xpu.get_device_name` - `torch.xpu.get_device_capability` - `torch.xpu.get_device_properties` - ==================== - `torch.xpu._DeviceGuard` - `torch.xpu._is_compiled` - `torch.xpu._get_device` # Additional Context We will implement the support of lazy initialization in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116850 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet	2024-02-01 12:31:26 +00:00
Michael Suo	eaa45f47f8	[sigmoid] fix for torchbind serialization (#118791 ) Summary: There is an annoying inconsistency in how we pickle custom objs. `torch.save` will invoke regular pickle, for which we have bound `__setstate__`/`__getstate__` methods on `torch.ScriptObject`: https://fburl.com/code/4howyl4u. This serializes in a different format than TorchScript does, which uses the TS C++ pickler. The issue we were facing was using the Python pickler to save, and the C++ pickler to load. If we use the C++ pickler to both save and load (plus some plumbing to get type/object resolution to work correctly), then things should work. Test Plan: ran SherlockNoMad's repro ``` buck2 run 'fbcode//mode/dev-nosan' scripts/bahuang:export_torchbind -- --logging DBG ``` Got to a new error, which has to do with how we're initializing the graph, but will leave that for future diffs. Reviewed By: SherlockNoMad Differential Revision: D53248454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118791 Approved by: https://github.com/qxy11, https://github.com/SherlockNoMad, https://github.com/khabinov	2024-02-01 10:09:07 +00:00
Yifu Wang	b778f44e97	Allow using native c10d_functional via _functional_collectives (#113057 ) This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification. NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057 Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol	2024-01-30 02:34:25 +00:00
Colin Peppler	1f6aa4b336	[mypy] Enable follow_imports = normal for mypy-torch.backends.* (#116311 ) Summary: Test Plan: ``` lintrunner --take MYPYINDUCTOR --all-files ok No lint issues. lintrunner -a ok No lint issues. Successfully applied all patches. ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/116311 Approved by: https://github.com/int3	2024-01-25 20:17:22 +00:00
Jeff Daily	01abb5af21	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10, https://github.com/malfet	2024-01-22 18:33:41 +00:00
Guilherme Leobas	80cf0ce153	Enhance torch.vmap support from inside torch.compile (#116050 ) This work rewrites vmap support in torch.compile by inlining most of the frames into the existing FX graph. It also unlocks to PyTorch to support features that were previously missing, such as keyword args. Fixes: https://github.com/pytorch/pytorch/issues/114306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116050 Approved by: https://github.com/zou3519	2024-01-22 17:53:45 +00:00
PyTorch MergeBot	b637fdc8b3	Revert "additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 )" This reverts commit `74e1362499`. Reverted https://github.com/pytorch/pytorch/pull/115214 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/115214#issuecomment-1900815152))	2024-01-19 17:35:04 +00:00
Jeff Daily	74e1362499	additional support for float8_e4m3fnuz and _e5m2fnuz (#115214 ) Follow up to #107586. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115214 Approved by: https://github.com/peterbell10	2024-01-19 00:50:18 +00:00
PyTorch MergeBot	2f84a9d37c	Revert "[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 )" This reverts commit `5aa92b5090`. Reverted https://github.com/pytorch/pytorch/pull/115663 on behalf of https://github.com/PaliC due to Unfortunately, this pr breaks cuda builds internally ([comment](https://github.com/pytorch/pytorch/pull/115663#issuecomment-1899388813))	2024-01-18 23:40:30 +00:00
Eddie Yan	5aa92b5090	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-01-18 01:20:36 +00:00
FFFrog	7b0926cc3e	Fix wrong class inheritance in pyi (#116404 ) As the title stated. `f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404 Approved by: https://github.com/ezyang, https://github.com/wconstab	2024-01-12 21:25:29 +00:00
Bin Bao	70f3a530d7	[AOTI] Add pybind for AOTIModelContainerRunnerCpu and AOTIModelContainerRunnerCuda (#116269 ) Summary: Now we can allocate an AOTIModelContainerRunner object instead of relying on torch.utils.cpp_extension.load_inline. Also renamed AOTInductorModelRunner to AOTIRunnerUtil in this PR. Test Plan: CI Reviewed By: khabinov Differential Revision: D52339116 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116269 Approved by: https://github.com/khabinov	2024-01-04 18:58:24 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
angelayi	6b91e6907e	Add setUserEnabledNNPACK config (#116152 ) When exporting a model with a convolution kernel on cpu, if mkldnn is disabled and nnpack is enabled, export will go down the nnpack optimized convolution kernel for certain shapes ((code pointer)[`cd449e260c/aten/src/ATen/native/Convolution.cpp (L542-L552)`]). This means that we will automatically create a guard on that certain shape. If users want to export without any restrictions, one option is to disable nnpack. However, no config function exists for this, so this PR is adding a config function, similar to the `set_mkldnn_enabled` function. Original context is in https://fb.workplace.com/groups/1075192433118967/posts/1349589822345892/?comment_id=1349597102345164&reply_comment_id=1349677642337110. To test the flag, the following script runs successfully: ``` import os import torch from torchvision.models import ResNet18_Weights, resnet18 torch.set_float32_matmul_precision("high") model = resnet18(weights=ResNet18_Weights.DEFAULT) model.eval() with torch.no_grad(): # device = "cuda" if torch.cuda.is_available() else "cpu" torch.backends.mkldnn.set_flags(False) torch.backends.nnpack.set_flags(False) # <--- Added config device = "cpu" model = model.to(device=device) example_inputs = (torch.randn(2, 3, 224, 224, device=device),) batch_dim = torch.export.Dim("batch", min=2, max=32) so_path = torch._export.aot_compile( model, example_inputs, # Specify the first dimension of the input x as dynamic dynamic_shapes={"x": {0: batch_dim}}, # Specify the generated shared library path options={ "aot_inductor.output_path": os.path.join(os.getcwd(), "resnet18_pt2.so"), "max_autotune": True, }, ) ``` I'm not sure who to add as reviewer, so please feel free to add whoever is relevant! Pull Request resolved: https://github.com/pytorch/pytorch/pull/116152 Approved by: https://github.com/malfet	2023-12-27 06:00:16 +00:00
Nikita Shulga	0aa185f394	[BE] Make `torch.cuda.has_magma` a build time check (#116299 ) Perhaps originally one needed to query about GPU capability, but right now it's a simple check for a build time flag: `52f0457d7d/aten/src/ATen/cuda/detail/CUDAHooks.cpp (L165-L171)` Alternative, to avoid `at::hasMAGMA()` call one can implement it as follows: ```cpp const auto use_magma = caffe2::GetBuildOptions().at("USE_MAGMA"); return PyBool_FromLong(use_magma == "1"); ``` Make this check very similar to `_has_mkldnn` `0978482afa/torch/csrc/Module.cpp (L1793-L1794)` Test plan: Run `lldb -- python3 -c "import torch;print(torch.cuda.has_magma)"` and make sure it returns True and that `cuInit` is not called Pull Request resolved: https://github.com/pytorch/pytorch/pull/116299 Approved by: https://github.com/seemethere, https://github.com/albanD	2023-12-26 23:37:23 +00:00
fduwjj	f6dfbffb3b	[c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238 ) For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL. This is a debugging feature so that we can rule out the bug from c10d level. <img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238 Approved by: https://github.com/wconstab, https://github.com/fegin	2023-12-25 22:25:38 +00:00
zdevito	4afe2687d5	Reland "Serve multistream graph captures from correct pool (#114647 )" (#116199 ) Fixes a variable shadowing problem that broke internal builds. This reverts commit `fe15645619`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116199 Approved by: https://github.com/eellison	2023-12-20 21:22:34 +00:00
PyTorch MergeBot	fe15645619	Revert "Serve multistream graph captures from correct pool (#114647 )" This reverts commit `8a445f7bd5`. Reverted https://github.com/pytorch/pytorch/pull/114647 on behalf of https://github.com/jeanschmidt due to breaking multiple internal build jobs, please check internal diff in order to obtain more details ([comment](https://github.com/pytorch/pytorch/pull/114647#issuecomment-1864840724))	2023-12-20 17:11:42 +00:00
zdevito	8a445f7bd5	Serve multistream graph captures from correct pool (#114647 ) This fixes #114320 by placing the logic for determining whether to allocate to a pool inside a callback that is controlled by CUDAGraph.cpp or by the python bound api to allocate a stream directly to a pool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114647 Approved by: https://github.com/ngimel, https://github.com/eellison	2023-12-18 18:24:15 +00:00
fduwjj	40ce9a4cfb	[c10d] Create a python c10d API _set_pg_timeout to set timeout (#115453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115453 Approved by: https://github.com/wconstab, https://github.com/H-Huang	2023-12-12 20:52:43 +00:00
Nikita Shulga	b706c4116d	[MPS] Add MacOS 14 runtime check (#115512 ) Prerequisite for adding more complex type support and FFT operation Check using `conjugateWithTensor:name:` selector defined as follows ```objc /// Returns the complex conjugate of the input tensor elements. /// /// - Parameters: /// - tensor: The input tensor. /// - name: An optional string which serves as an identifier for the operation.. /// - Returns: A valid `MPSGraphTensor` object containing the elementwise result of the applied operation. -(MPSGraphTensor ) conjugateWithTensor:(MPSGraphTensor ) tensor name:(NSString * _Nullable) name MPS_AVAILABLE_STARTING(macos(14.0), ios(17.0), tvos(17.0)) MPS_SWIFT_NAME( conjugate(tensor:name:) ); ``` - Rename `isOnMacOS13orNewer(unsigned minor)` hook to `isOnMacOSorNewer(major, minor)` - Replace `torch._C.__mps_is_on_macos_13_or_newer` with `torch._C._mps_is_on_macos_or_newer` - Add `torch.backends.mps.is_macos_or_newer` public API Pull Request resolved: https://github.com/pytorch/pytorch/pull/115512 Approved by: https://github.com/albanD	2023-12-11 21:11:42 +00:00
Chip Turner	937d616e82	Re-enable type checking for distributed_c10d.py (#115223 ) Re-enable type checking for distributed_c10d.py Type checking for distributed_c10d.py was inadvertently turned off in issues that have accumulated since. Note: the backwards compatibility linter does not like some of these changes. But they were incorrect before. This needs human verification, however. #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/115223 Approved by: https://github.com/wconstab	2023-12-09 11:07:54 +00:00
albanD	a2b89154bf	New swap function (#111747 ) This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem. The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references. This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up. This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs). Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots. The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses. Draft right now to see what @colesbury thinks about doing this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747 Approved by: https://github.com/colesbury	2023-12-08 18:49:35 +00:00
Pritam Damania	f505d76462	Bug fixes to DDP _update_process_group API. (#114194 ) https://github.com/pytorch/pytorch/pull/113580 introduced the `DDP._update_process_group` API. However, the implementation did not correctly reset all of the necessary state in the reducer. In particular if an error occurred during backward, DDP would end up in an incorrect state. As a result, in this PR I've enhanced the unit test to test for this case and also appropriately fixed resetting Reducer state. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114194 Approved by: https://github.com/rohan-varma	2023-11-27 23:52:40 +00:00
Antonio Kim	7fc292930c	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-21 23:07:21 +00:00
Jez Ng	631fb33fd6	Enable import following in MYPYNOFOLLOW (now MYPYINDUCTOR) (#113830 ) Skipping importing some packages for now to make this change more tractable. For some reason, lintrunner on CI raises errors in all imported `.pyi` files, even though it doesn't on my local machine. The errors are all from missing generic types, as the MYPYINDUCTOR config has `disallow_any_generics` set. I have thus added `disable-error-code` comments to the relevant files, though I fixed a few that were easy enough. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113830 Approved by: https://github.com/Skylion007 ghstack dependencies: #113722, #113721	2023-11-17 18:24:21 +00:00
Jez Ng	0c8362de1a	[dynamo] Make {guards,eval_frame}.py pass follow_imports typechecking (#113721 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113721 Approved by: https://github.com/Skylion007 ghstack dependencies: #113722	2023-11-17 18:24:21 +00:00
Brian Vaughan	dbb96ef30d	improve annotation device parameters where a device ordinal is allowed (#113647 ) Using mypy in code that depends on pytorch, I noticed that the type annotation doesn't allow a device ordinal. `error: Argument "device" to "to_empty" of "Module" has incompatible type "int"; expected "str \| device" [arg-type]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/113647 Approved by: https://github.com/albanD	2023-11-17 14:41:22 +00:00
Pritam Damania	17e2313dd3	Add an API to DDP for dynamically updating the underlying process group. (#113580 ) # Motivation If we would like to reinitialize DDP with a different PG with `torch.compile`, we need to do the following: ``` del old_ddp del old_pg pg = init_pg(...) ddp = DDP(pg) model = torch.compile(DDP) ``` This results in recompilation of the entire model and is very expensive. Since the only thing we need to update is the PG, we should be able to do this without having to compile the model again. # Proposal As a result, in this PR I've introduced an `_update_process_group` API which can dynamically update the underlying ProcessGroup used by DDP without needing to reinitialize DDP again. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113580 Approved by: https://github.com/fduwjj	2023-11-15 09:05:02 +00:00
drisspg	9b0f2f8d94	expose sdpa helpers to python (#110496 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110496 Approved by: https://github.com/jbschlosser	2023-11-15 07:34:34 +00:00
Jez Ng	fda94124d7	[inductor] Make {cudagraph_trees,decomposition,post_grad}.py pass follow_imports typechecking (#113609 ) I added explicit imports to `kernel/__init__.py` as mypy doesn't seem to understand an empty `__init__.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113609 Approved by: https://github.com/eellison	2023-11-15 05:04:11 +00:00
PyTorch MergeBot	252e68a83b	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `54493fe8c4`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is, unfortunately, still breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1811625557))	2023-11-15 00:51:23 +00:00
Jez Ng	ffc3731dc4	Update TensorBase.to()'s' signature; create {guards,compiled_autograd}.pyi (#113536 ) I had to explicitly import submodules in torch/_C/_dynamo/__init__.pyi because mypy doesn't seem to understand empty `__init__.py[i]` files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113536 Approved by: https://github.com/ezyang ghstack dependencies: #113412, #113535	2023-11-14 04:31:12 +00:00
Antonio Kim	54493fe8c4	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-13 23:18:14 +00:00
Jez Ng	6805d1e1d6	[inductor] Make graph.py pass follow_imports typechecking (#113518 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/113518 Approved by: https://github.com/Skylion007 ghstack dependencies: #113413	2023-11-11 22:15:46 +00:00
Jez Ng	a8cf04fd2a	[inductor] Make {output_graph,pad_mm}.py pass follow_imports typechecking (#113413 ) I changed OutputGraph.nn_modules' type to `Dict[str, Any]` because it seems that `register_attr_or_module` can populate it with essentially any type. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113413 Approved by: https://github.com/Skylion007	2023-11-11 22:15:46 +00:00
Jez Ng	a2c32b8bd0	[inductor] Make codegen/{common,wrapper,cuda/cutlass_utils}.py pass follow_imports typechecking (#113411 ) SymIntType is referenced by wrapper.py, so I added its .pyi definition. I also added SymBoolType along the way for completeness. The `insinstance` checks in wrapper.py reference torch.Type, which seems to cause mypy to choke. Not entirely sure why; I've just added type-ignore comments for now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113411 Approved by: https://github.com/Skylion007 ghstack dependencies: #113409, #113410	2023-11-10 19:58:08 +00:00
Jez Ng	a3a55df4af	[dynamo] Add .pyi declaration of _CacheEntry (#113305 ) This is required for enabling follow-imports=silent; referenced by _dynamo/types.py. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113305 Approved by: https://github.com/Skylion007, https://github.com/ezyang ghstack dependencies: #113304	2023-11-09 21:55:49 +00:00
Wes Bland	9d765d28ca	[pytorch] Add binding to get nccl version suffix (#112884 ) Summary: Adds a Python to C binding to get the NCCL_SUFFIX value for more accurate NCCL version information and add that to the NCCL version tuple. Differential Revision: D50978181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112884 Approved by: https://github.com/kwen2501	2023-11-08 02:51:22 +00:00
Richard Zou	d1c092ae1b	Update impl_abstract_pystub to be less boilerplatey (#113182 ) Summary: We've made the following changes: - The new way to use the API is `m.impl_abstract_pystub(module, context)`. Every subsequent m.def of an op inside the TORCH_LIBRARY block gives the op the `impl_abstract_pystub`. - Added a mechanism to determine if an operator was defined in Python or C++. Library.define in Python appends the op to a global set, which is analogous to what we do for tracking Library.impl. - If someone does `torch.library.impl_abstract` in Python for an operator, then we require that it has an `impl_abstract_pystub` specified and we also check that the module in the `impl_abstract_pystub` is the same as the module where the call to `torch.library.impl_abstract` exists. - Unfortunately we can't check the "context" (which is the buck target on buck-based systems) because buck sits above us. bypass-github-export-checks Test Plan: - existing tests Differential Revision: D51080493 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113182 Approved by: https://github.com/ezyang	2023-11-08 00:39:00 +00:00
PyTorch MergeBot	bc3e2e03cd	Revert "Update impl_abstract_pystub to be less boilerplatey (#112851 )" This reverts commit `6ae4e3a8d2`. Reverted https://github.com/pytorch/pytorch/pull/112851 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/112851#issuecomment-1799539354))	2023-11-07 18:53:13 +00:00
Richard Zou	6ae4e3a8d2	Update impl_abstract_pystub to be less boilerplatey (#112851 ) Summary: We've made the following changes: - The new way to use the API is `m.impl_abstract_pystub(module, context)`. Every subsequent m.def of an op inside the TORCH_LIBRARY block gives the op the `impl_abstract_pystub`. - Added a mechanism to determine if an operator was defined in Python or C++. Library.define in Python appends the op to a global set, which is analogous to what we do for tracking Library.impl. - If someone does `torch.library.impl_abstract` in Python for an operator, then we require that it has an `impl_abstract_pystub` specified and we also check that the module in the `impl_abstract_pystub` is the same as the module where the call to `torch.library.impl_abstract` exists. - Unfortunately we can't check the "context" (which is the buck target on buck-based systems) because buck sits above us. Test Plan: - existing tests Differential Revision: D50972148 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112851 Approved by: https://github.com/ezyang	2023-11-07 16:07:42 +00:00
PyTorch MergeBot	9a28a7b498	Revert "Add support for `torch.Generator` type in TorchScript (#110413 )" This reverts commit `27e31ab6e8`. Reverted https://github.com/pytorch/pytorch/pull/110413 on behalf of https://github.com/PaliC due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/110413#issuecomment-1799003164))	2023-11-07 15:53:32 +00:00
Ke Wen	bb7ac12cbf	[ProcessGroupNCCL] Avoid recording stream for broadcast and scatter (#112896 ) Summary: Follows PR #111431, save memory for DTensor init Test Plan: Sandcastle Reviewed By: wanchaol Differential Revision: D50985365 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112896 Approved by: https://github.com/wanchaol	2023-11-07 15:44:04 +00:00
Will Constable	ff51f94e32	[Reland] Fix default timeouts for python entrypoints (e.g. init_process_group) (#113094 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113094 Approved by: https://github.com/fduwjj	2023-11-07 05:34:26 +00:00
PyTorch MergeBot	75adb9f371	Revert "Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 )" This reverts commit `f9d47e1381`. Reverted https://github.com/pytorch/pytorch/pull/112893 on behalf of https://github.com/clee2000 due to sorry this seems to have broken inductor `f9d47e1381` https://github.com/pytorch/pytorch/actions/runs/6776367936/job/18418174752 ([comment](https://github.com/pytorch/pytorch/pull/112893#issuecomment-1796979811))	2023-11-06 22:49:53 +00:00
Antonio Kim	27e31ab6e8	Add support for `torch.Generator` type in TorchScript (#110413 ) - Add support for `torch.Generator` type in TorchScript - Add `generator` args to all `torch.nn.init` functions that call `uniform_` or `normal_` - Add support for `torch.Generator` in LTC's TorchScript backend (CC: @wconstab) CC: @eellison @davidberard98 @GlebKazantaev @behzad-a Pull Request resolved: https://github.com/pytorch/pytorch/pull/110413 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/glebk-cerebras, https://github.com/davidberard98	2023-11-06 21:27:02 +00:00
Will Constable	f9d47e1381	Fix default timeouts for python entrypoints (e.g. init_process_group) (#112893 ) Previous PRs changed the c++ default timeout for PGNccl, but this path was only hit in some cases, and the python defaults took over in other cases. This PR ensures that NCCL pg always default to the changed NCCL-specific timeout value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112893 Approved by: https://github.com/xw285cornell, https://github.com/kwen2501, https://github.com/XilunWu ghstack dependencies: #112611, #112803	2023-11-06 20:48:39 +00:00
Jez Ng	b8ac5bbcbd	[dynamo] Enable typechecking for bytecode_transformation.py (#112561 ) As part of this diff, I have upgraded the `python_version` config setting to 3.11. `bytecode_transformation.py` (and a few other files) have functions using APIs only available in Python 3.11+. Those APIs are gated by a sys.version_info check in their typeshed .pyi files. So setting the min version to 3.11 allows those functions to typecheck properly. An alternative is to make the relevant types Any: ``` if sys.version_info >= (3, 11): _Positions = dis.Positions else: _Positions = Any ``` However, with python_version = 3.8, that means we're not getting any useful typechecking signal when encountering values of type _Position. Changing the python_version to 3.11 does mean that we will stop typechecking codepaths that run only on lower versions, but that seems a small price to pay. It does also mean that we won't catch code that uses newer APIs without the appropriate version check, but again, not sure this has much of an impact. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112561 Approved by: https://github.com/ezyang	2023-11-04 19:36:27 +00:00
isdanni	43fb5147e2	[BE] Enable Ruff's Flake8 PYI001 (#112823 ) Enable [unprefixed-type-param (PYI001)](https://docs.astral.sh/ruff/rules/unprefixed-type-param/#unprefixed-type-param-pyi001) Link: #110950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/112823 Approved by: https://github.com/Skylion007	2023-11-03 17:25:39 +00:00
Yifu Wang	ac9476ba99	Add .boxed() to c10d::ProcessGroup and c10d::Work's pybind (#111997 ) Summary: When passed from C++ to Python, `c10d::ProcessGroup` and `c10d::Work` are automatically converted to their pybind class which can't be used for dispatcher ops. `.boxed()` exposes `c10d::ProcessGroup` and `c10d::Work` as boxed custom class object to Python. ```python import tempfile import torch import torch.distributed as dist if __name__ == "__main__": with tempfile.NamedTemporaryFile(delete=False) as tmpf: dist.init_process_group( backend="nccl", init_method=f"file://{tmpf.name}", rank=0, world_size=1 ) group = dist.group.WORLD print(group) print(group.boxed()) ``` ``` <torch.distributed.distributed_c10d.ProcessGroup object at 0x7fe42fb78d30> ScriptObject <__torch__.torch.classes.c10d.ProcessGroup> ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/111997 Approved by: https://github.com/lw	2023-11-02 20:35:20 +00:00
Kurt Mohler	fd209543d5	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 ) Part of #109802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377 Approved by: https://github.com/albanD, https://github.com/aaronenyeshi	2023-11-01 16:10:09 +00:00
Jithun Nair	333d5821ee	[ROCm] Add gcnArchName to collect_env and torch.cuda.get_device_properties (#107477 ) Printing just the device name is not helpful when investigating PyTorch issues filed for specific AMD GPUs, as the support/issue might depend on the gfx arch, which is part of the gcnArchName property. `torch.cuda.get_device_properties(0).gcnArchName` will print the value of the `gcnArchName` property: eg. ``` >>> torch.cuda.get_device_properties(0).gcnArchName 'gfx906:sramecc+:xnack-' ``` ``` root@6f064e3c19fb:/data/pytorch/test# python ../torch/utils/collect_env.py ... GPU models and configuration: AMD Radeon Graphics(gfx906:sramecc+:xnack-) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/107477 Approved by: https://github.com/albanD	2023-10-31 23:05:36 +00:00
PyTorch MergeBot	ace2713d1e	Revert "Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 )" This reverts commit `f1785373c0`. Reverted https://github.com/pytorch/pytorch/pull/111377 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111377#issuecomment-1784179040))	2023-10-29 17:41:55 +00:00
dshi7	fbff99ffea	Add regex matching to Inductor all2all collective unit tests (#112077 ) Fixes #111776 Support check_regex in FileCheck() by adding `find_regex` in `struct TORCH_API StringCordView`. Callsite accepts RE syntax for std::regex. However, I haven't figured out submatch ID yet. For example, "buf5[0], buf6_inputs[0]" is still considered a match. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112077 Approved by: https://github.com/yf225	2023-10-26 08:29:30 +00:00
Kurt Mohler	f1785373c0	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 ) Part of #109802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377 Approved by: https://github.com/albanD	2023-10-26 02:39:06 +00:00
Jez Ng	ad3572a5dc	Unify torch.SymInt and torch.types.SymInt (#110573 ) Per @ezyang, this should be fine Pull Request resolved: https://github.com/pytorch/pytorch/pull/110573 Approved by: https://github.com/ezyang	2023-10-24 16:17:23 +00:00
Richard Zou	66b74d231a	Change torch.library.impl to accept a device string (#111659 ) torch.library.impl now accepts a device string (e.g. "cpu", "cuda"). It still accepts DispatchKey strings, but we no longer document this, because using arbitrary DispatchKeys is more for the power users. We map the device string to a DispatchKey and then register the impl for said DispatchKey. A user may also specify multiple device strings at once or specify "types=default" to get a CompositeExplicitAutograd registration. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/111659 Approved by: https://github.com/soulitzer ghstack dependencies: #111380	2023-10-23 23:02:41 +00:00
Ke Wen	18cc8a92ac	[ProcessGroupNCCL] Avoid recording stream for synchronous ops (#111431 ) For synchronous ops (i.e. `asyncOp = False`), we don't want to record streams because we know that the NCCL stream will join back to the "current" stream right after this op. So we might just as well keep the stream ownership of the input/output tensors unchanged. The benefit would be that the allocation/free of the tensors would look deterministic to the "current" stream so that the caching allocator can reuse memory pool for this stream in a clever way. To prevent the input/output tensors from being recycled by python, we rely on the stashing mechanism in ProcessGroupNCCL (which can be also turned on by setting `TORCH_NCCL_AVOID_RECORD_STREAMS=1`). This mechanism change is for libraries like FSDP which uses `all_gather_into_tensor` and `reduce_scatter_tensor` in a synchronous way and which cannot set `TORCH_NCCL_AVOID_RECORD_STREAMS=1` for their users. And therefore, this change is limited to these two collectives for now. Cc: @awgu @janeyx99 @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/111431 Approved by: https://github.com/H-Huang	2023-10-19 00:41:09 +00:00
Nikita Shulga	f84755bcac	Fix _CudaStreamBase type annotations (#111387 ) Make it inherit from `Stream` as indeed it is, see `97a513ed07/torch/csrc/cuda/Stream.cpp (L208)` and ``` python3 -c "import torch;print(torch._C._CudaStreamBase.__base__)" <class 'torch.Stream'> ``` Fixes https://github.com/pytorch/pytorch/issues/111268 TODO (in separate PR): Revive `test_typing` and add regression test Pull Request resolved: https://github.com/pytorch/pytorch/pull/111387 Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007	2023-10-16 23:26:58 +00:00
PyTorch MergeBot	1e70f4d02c	Revert "Reland #2 "[C10] PG observability hooks. (#108815 , #110907 )" (#111072 )" This reverts commit `bb1424d46e`. Reverted https://github.com/pytorch/pytorch/pull/111072 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111072#issuecomment-1765399829))	2023-10-16 23:03:26 +00:00
Jesse Cai	4c01686027	Public API for constructing NT with jagged layout from tensor list (#111078 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111078 Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer ghstack dependencies: #109123	2023-10-13 03:27:41 +00:00

... 3 4 5 6 7 ...

1178 Commits