pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Chien-Chin Huang	e72936c27c	[PT2D] Fix the circular import issue (#125618 ) As title Differential Revision: [D57011394](https://our.internmc.facebook.com/intern/diff/D57011394/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125618 Approved by: https://github.com/wz337	2024-05-07 05:10:18 +00:00
Muralidhar Andoorveedu	b96b1e8cff	[Distributed] Add P2P versions of *object_list operations (#124379 ) This PR adds `send_object_list` and `recv_object_list` to `distributed_c10d.py`. This is extending functionality already present in PyTorch with `broadcast_object_list` that I noticed was missing and decided to upstream. With this change, sending and receiving arbitrary picklable python objects is possible. Relevant issue: https://github.com/pytorch/pytorch/issues/3473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124379 Approved by: https://github.com/kwen2501, https://github.com/wconstab	2024-05-03 23:22:58 +00:00
Chien-Chin Huang	1eb7b8eb60	[PT2D] Ensure the trace rules are correct with distributed (#125333 ) Summary: 1. Avoid using `torch._dynamo.disable`. 2. Clear the LRU cache of the trace rules. This won't do anything if rules are not evluated before PG initilization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125333 Approved by: https://github.com/yanboliang	2024-05-02 16:28:38 +00:00
feifan	197612c84c	ProcessGroupWrapper support custom backend (#124447 ) Fixes #ISSUE_NUMBER In current code, ProcessGroupWrapper works only for `GLOO, NCCL, UCC` when `TORCH_DISTRIBUTED_DEBUG=DETAIL`. I read the ProcessGroupWrapper code，find that communication_op in ProcessGroupWrapper is just communication_op in origin_backend + runCollectiveChecks in gloo, like allreduce: `82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L406-L411)` `runCollectiveChecks` is used to `collective finger print` for tensors and run gloo's `monitoredBarrier`. `82e0153487/torch/csrc/distributed/c10d/ProcessGroupWrapper.cpp (L586-L590)` I dont know why ProcessGroupWrapper doesn't work for all backend, but I think custom backend can support it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124447 Approved by: https://github.com/kwen2501	2024-05-01 19:59:55 +00:00
Will Constable	8f31988088	[C10D] Document 'tag' limitation for nccl send/recv (#125278 ) Existing documentation on isend/irecv also applies to send/recv. This PR copies the doc/warning to send/recv ops as well. Note: tag may be supplied, but will be ignored when used with nccl backend. Fixes #94819 #125079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125278 Approved by: https://github.com/kwen2501	2024-05-01 02:53:30 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Kiuk Chung	87f44d70b1	[torch/distributed] Check gloo availability when doing isinstance(pg,… (#124233 ) Fixes a bug where a reference to `_ProcessGroupWrapper` is used without first checking whether gloo is available. This fails on pytorch builds that do not include gloo becuase `_ProcessGroupWrapper` is only pybinded when building with `USE_GLOO=1`. Therefore, creation of a new process group fails with a `NameError` when only NCCL is available as the backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124233 Approved by: https://github.com/rohan-varma, https://github.com/d4l3k	2024-04-19 04:07:00 +00:00
Shuqiang Zhang	ca6a0e1348	[c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334 ) Summary: This ENV was introduced to safely rollout the behavior change in destroy process group (e.g., call ncclCommsAbort). Now that this behavior change were already rolled out, we no longer need this env and we should clean up it to keep our code cleaner Test Plan: Modified/existing ut pass Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334 Approved by: https://github.com/wconstab	2024-04-18 23:42:55 +00:00
Will Constable	1885c3972d	[C10D] Add dist.get_node_local_rank helper (#123992 ) Fixes #122816 Summarizing the pros/cons of the request and motivation from #122816 - (+) it's really common for users to do 'os.getenv["LOCAL_RANK"]' so we should provide a helper - (-) we can't really control if/how local rank information is made available, but it is handled automatically if torchrun is used. We can assume local rank is correctly passed if it is passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123992 Approved by: https://github.com/shuqiangzhang, https://github.com/zdevito, https://github.com/XilunWu	2024-04-16 00:09:46 +00:00
Shengbao Zheng	9fa922c2ed	[profiler] Log process group name instead of pg uid (#124035 ) Summary: As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler. - group_name remains as the unique identifier, e.g. “0”, "1" - group_desc will be the user specified name, e.g. "fsdp". Reviewed By: aaronenyeshi, kwen2501 Differential Revision: D55610682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035 Approved by: https://github.com/aaronenyeshi	2024-04-15 21:49:06 +00:00
Shengbao Zheng	4e9094533e	[c10d/nccl-pg] allow user to pass process group description (#123472 ) Summary: We need a way to allow user set a customized description for a process group, e.g. FSDP, PP. Here are several use cases of user specified group_desc: - Logging: we can easily match a log line and understand what's this collective/pg is used to. - Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP. - Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG. Solution: Add a group_desc field to c10d Differential Revision: D55781850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472 Approved by: https://github.com/kwen2501	2024-04-12 08:44:21 +00:00
Gufan Yin	65710d95c9	Fix example in torch.distributed.new_subgroups docstring (#123492 ) Summary: As title Test Plan: Run the example locally Reviewed By: zhaojuanmao Differential Revision: D55617871 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123492 Approved by: https://github.com/wconstab, https://github.com/wz337	2024-04-10 03:33:07 +00:00
Shengbao Zheng	ae6f8d923c	Pass and record process_group_name when creating ProcessGroupNCCL (#123117 ) Summary: Pass python c10d group_name to c++ ProcessGroupNCCL so that the pg name will be consistent across different layers. Also record pg_name in flight recorder entry. Differential Revision: D55597200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123117 Approved by: https://github.com/wconstab	2024-04-05 18:57:45 +00:00
Chirag Pandya	b6201a60c5	[BE] minor logging cleanup in distributed (#122921 ) Summary: Minor logging cleanup in distributed library 1. Don't use "f" formatted strings - address linter issues. 2. Nits: Make use of unused `e` (error) in a few logs. 3. Change info->debug as asked in issue #113545 4. Nit: rename log -> logger in a few files for consistency 5. Fix a linter error. Test Plan: 1. Local build passes. 2. Linter is happy. Reviewers: wanchaol Pull Request resolved: https://github.com/pytorch/pytorch/pull/122921 Approved by: https://github.com/wanchaol	2024-03-29 03:34:01 +00:00
Yifu Wang	36188360dd	[dynamo] support torch.distributed.{group.WORLD, GroupMember.WORLD, distributed_c10d._get_default_group} (#120560 ) Fixes https://github.com/pytorch/pytorch/issues/120431 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120560 Approved by: https://github.com/wconstab	2024-03-24 11:13:05 +00:00
lezcano	8a5a377190	Move doc links to point to main (#121823 ) The previous links were pointing to an outdated branch Command: `find . -type f -exec sed -i "s:docs/main:docs/master:g" {} + ` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121823 Approved by: https://github.com/albanD, https://github.com/malfet	2024-03-15 19:49:37 +00:00
Ke Wen	038b2e8780	[c10d] Add complex support for P2P (#121240 ) Fixes the following error when `tensor` is a complex tensor: ``` [rank0]: return pg.send([tensor], dst, tag) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: RuntimeError: Unconvertible NCCL type ComplexFloat ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121240 Approved by: https://github.com/shuqiangzhang	2024-03-08 22:47:49 +00:00
Shengbao Zheng	60aaba4128	create function to get ProcessGroupNCCL uid (#121132 ) Summary: expose ProcessGroupNCCL uid Differential Revision: D54446056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121132 Approved by: https://github.com/aaronenyeshi	2024-03-07 18:34:38 +00:00
Will Constable	f5b99976ad	[C10D] Make _set_pg_timeout work with DeviceMesh PG (#120850 ) Fixes #120847 Makes _set_pg_timeout work on nccl and/or gloo backends instead of working only on one backend (gloo) in cases that both backends exist for the group. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120850 Approved by: https://github.com/XilunWu, https://github.com/wanchaol	2024-02-29 03:41:15 +00:00
Will Constable	c016ffed5b	[C10D] Fix logic for default group=None in _set_pg_timeout (#120686 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120686 Approved by: https://github.com/yifuwang	2024-02-28 20:31:14 +00:00
Chien-Chin Huang	f422467ccb	[BE]Delay the call to set_pytorch_distributed_envs_from_justknobs (#120625 ) When initializing the default process group, `init_process_group` will show the explicit message indicating the default process group is being initialized twice. However, with `set_pytorch_distributed_envs_from_justknobs` being the very first line in `init_process_group`, the error message becomes implicit and hard to understand the root cause when testing with the FB code base. Differential Revision: [D54206202](https://our.internmc.facebook.com/intern/diff/D54206202/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120625 Approved by: https://github.com/wconstab, https://github.com/yifuwang	2024-02-28 18:34:45 +00:00
Shengbao Zheng	440a9b212d	[profiler] log process group config information in distributedInfo field (#119443 ) Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well Differential Revision: D53557965 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119443 Approved by: https://github.com/kwen2501	2024-02-27 18:21:54 +00:00
Yifu Wang	11e4a9266d	Temporarily support ranks + tag as pg identifier in native funcol (#120226 ) As communicated in https://github.com/pytorch/pytorch/issues/93173#issuecomment-1907095208, although we are dropping `(ranks, tag)` as group identifier in funcols, there will be a grace period for migration. This PR adds temporary `(ranks, tag)` support in native funcols. It also helps us decouple the py funcol -> native funcol transition from the API change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120226 Approved by: https://github.com/wanchaol, https://github.com/wconstab ghstack dependencies: #120042, #120043, #120070	2024-02-22 20:24:16 +00:00
Shengbao Zheng	9630bcbd49	[execution trace/chakra] remove backend_id from pg_info (#120038 ) Summary: PR 104373(https://github.com/pytorch/pytorch/pull/104373) log backend which has an unsafe dict loop up that might fail. We decide to deprecate backend_id and use pg id/name directly. Differential Revision: D53676181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120038 Approved by: https://github.com/aaronenyeshi	2024-02-21 19:37:18 +00:00
Shuqiang Zhang	893dcac068	[c10d] explicitly abort communicators in destroy_process_group call (#119250 ) Summary: This PR tries to resolve issue #119215. Basically, processgroup shutdown (and hence ncclCommAbort) is called in destroy_process_group APIs for the corresponding PGs. and in the destructor of ProcessGroup, we avoid calling abort/ncclCommAbort. Instead, it just checks if the users have explicitly already called destroy_process_group. If not, Destructor will log a warning and encourage/expect users to do so for cleanup of resources of PGs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119250 Approved by: https://github.com/minsii, https://github.com/kwen2501	2024-02-12 18:40:28 +00:00
Alexander Grund	69344fe987	c10d: Don't add NCCL backend by default without CUDA (#119149 ) The NCCL backend requires CUDA (including devices) to be available. So don't use that backend by default if that isn't the case to avoid the following error when creating a CPU-only device mesh: > RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found! Fixes #117746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119149 Approved by: https://github.com/kwen2501	2024-02-05 23:55:07 +00:00
Xu Song	3d8c36786b	Add device for distributed examples (#118867 ) ## 🐛 Describe the bug The following example (`all_reduce`) missed `device` allocation `a205e7bf56/torch/distributed/distributed_c10d.py (L2080-L2087)` ## Solution A better example should be like this `a205e7bf56/torch/distributed/distributed_c10d.py (L3212-L3222)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118867 Approved by: https://github.com/soulitzer	2024-02-02 05:51:59 +00:00
Catherine Lee	4f5785b6b3	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Co-authored-by: Catherine Lee <csl@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 21:07:01 +00:00
Aaron Gokaslan	1562dae62c	[BE]: Apply RUF025 dict.fromkeys preview rule (#118637 ) Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637 Approved by: https://github.com/albanD	2024-01-30 20:46:54 +00:00
PyTorch MergeBot	40ece2e579	Revert "Enable possibly-undefined error code (#118533 )" This reverts commit `4f13f69a45`. Reverted https://github.com/pytorch/pytorch/pull/118533 on behalf of https://github.com/clee2000 due to sorry i'm trying to figure out a codev merge conflict, if this works i'll be back to rebase and merge ([comment](https://github.com/pytorch/pytorch/pull/118533#issuecomment-1917695185))	2024-01-30 19:00:34 +00:00
Wei (Will) Feng	644f64f2d1	[c10d] added docstrings and tests for src / dst (#118593 ) Follow up https://github.com/pytorch/pytorch/pull/118359: whether``src`` and ``dst`` are base on global pg or sub pg * update c10d docstring: ``src`` / ``dst`` are base on global pg regardless of ``group`` arguments * communication ops with ``dst`` argument: ``reduce``, ``gather_object``, ``gather``, ``send``, ``isend`` * communication ops with ``src`` argument: ``irecv``, ``recv``, ``broadcast``, ``broadcast_object_list``, ``scatter``, ``scatter_object_list`` * ``pytest test/distributed/test_c10d_nccl.py -k subgroup`` 3 collectives are for pickable objects (``gather_object``, ``broadcast_object_list``, ``scatter_object_list``). There are 2 ways to set device * use device argument: it's implemented in ``broadcast_object_list``. maybe worth implementing in the other 2 * ``torch.cuda.set_device(global_rank)`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/118593 Approved by: https://github.com/wconstab	2024-01-30 17:47:58 +00:00
Will Constable	da0635d17c	Add pytorch-distributed justknobs helper (#118568 ) Summary: Sets up a helper that checks any JKs relevent to pytorch distributed, and propagates their values to ENV. Test Plan: Added unit test Differential Revision: D53192406 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118568 Approved by: https://github.com/zdevito	2024-01-30 08:13:52 +00:00
Edward Z. Yang	4f13f69a45	Enable possibly-undefined error code (#118533 ) Fixes https://github.com/pytorch/pytorch/issues/118129 Suppressions automatically added with ``` import re with open("error_file.txt", "r") as f: errors = f.readlines() error_lines = {} for error in errors: match = re.match(r"(.):(\d+):\d+: error:.\[(.*)\]", error) if match: file_path, line_number, error_type = match.groups() if file_path not in error_lines: error_lines[file_path] = {} error_lines[file_path][int(line_number)] = error_type for file_path, lines in error_lines.items(): with open(file_path, "r") as f: code = f.readlines() for line_number, error_type in sorted(lines.items(), key=lambda x: x[0], reverse=True): code[line_number - 1] = code[line_number - 1].rstrip() + f" # type: ignore[{error_type}]\n" with open(file_path, "w") as f: f.writelines(code) ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118533 Approved by: https://github.com/Skylion007, https://github.com/zou3519	2024-01-30 05:08:10 +00:00
Yifu Wang	b778f44e97	Allow using native c10d_functional via _functional_collectives (#113057 ) This diff introduces an env var `_USE_NATIVE_C10D_FUNCTIONAL` that tells `_functional_collective` to use native `c10d_functional` ops. The Python version and the native version will co-exist until we completely switch to the native version after more testing and verification. NOTE: `DeviceMesh` support for native `c10d_functional` will be added in a subsequent PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113057 Approved by: https://github.com/LucasLLC, https://github.com/wconstab, https://github.com/wanchaol	2024-01-30 02:34:25 +00:00
PyTorch MergeBot	bb55970e5b	Revert "Add justknobs env helper for pytorch distributed (#118451 )" This reverts commit `4d1bb2175a`. Reverted https://github.com/pytorch/pytorch/pull/118451 on behalf of https://github.com/wconstab due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/118451#issuecomment-1915369013))	2024-01-29 19:01:05 +00:00
Will Constable	4d1bb2175a	Add justknobs env helper for pytorch distributed (#118451 ) Summary: Adds a JK killswitch check and configures the env for enabling pytorch nccl flight recorder. Note- this only enables recording events in memory, not dumping them. Test Plan: CI test Reviewed By: zdevito Differential Revision: D52920092 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118451 Approved by: https://github.com/malfet	2024-01-29 08:57:16 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
Will Constable	70699a6357	[C10D] Add tests for gather and gather_object with subgroup (#118359 ) Addresses #118337 somewhat- we probably need to update docs. Let's first confirm what behavior we want. Identifies a couple of confusing things 1) 'dst' arg for many collectives is always in 'global' rank regardless of whether a subgroup is passed in. This needs a doc update 2) gather_object has a strong dependency on setting the cuda device; could we make that smoother? 3) gather_object also should be happy with an empty list on the dst side, imo Pull Request resolved: https://github.com/pytorch/pytorch/pull/118359 Approved by: https://github.com/weifengpy	2024-01-27 04:08:56 +00:00
PyTorch MergeBot	3d062f9abe	Revert "[pytorch][kineto] log process group config in distributed info (#117774 )" This reverts commit `9c1348feb3`. Reverted https://github.com/pytorch/pytorch/pull/117774 on behalf of https://github.com/aaronenyeshi due to This diff is breaking internal jobs, but has been internally reverted ([comment](https://github.com/pytorch/pytorch/pull/117774#issuecomment-1911251092))	2024-01-26 01:10:31 +00:00
Edward Z. Yang	3e76a0e9c2	Install an excepthook which annotates exceptions with rank information when distributed is initialized (#118190 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118190 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2024-01-25 20:43:18 +00:00
Shengbao Zheng	9c1348feb3	[pytorch][kineto] log process group config in distributed info (#117774 ) Summary: Process Group config is essential to analyze collective pattern. We have added this in Execution Trace. Now expose this information in Kineto as well Test Plan: Tested in HPC Differential Revision: D52882292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117774 Approved by: https://github.com/wconstab, https://github.com/aaronenyeshi	2024-01-25 00:08:10 +00:00
Ke Wen	1e185c7803	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-24 18:42:14 +00:00
PyTorch MergeBot	b5799d9977	Revert "[c10d] Barrier uses stream sync instead of device sync (#117804 )" This reverts commit `0f6bbb1c07`. Reverted https://github.com/pytorch/pytorch/pull/117804 on behalf of https://github.com/clee2000 due to sorry the docs test failure is real, I think it wants the lines after the .. note to be indented https://github.com/pytorch/pytorch/actions/runs/7616827874/job/20745016788. Marking as nosignal due to bad Dr. CI categorization ([comment](https://github.com/pytorch/pytorch/pull/117804#issuecomment-1904882487))	2024-01-22 21:54:03 +00:00
Ke Wen	0f6bbb1c07	[c10d] Barrier uses stream sync instead of device sync (#117804 ) Resubmitting #96785 Pull Request resolved: https://github.com/pytorch/pytorch/pull/117804 Approved by: https://github.com/wconstab	2024-01-22 20:14:51 +00:00
Ke Wen	6d96beb6be	[c10d] Remove health check (#117699 ) https://github.com/pytorch/pytorch/pull/114916 and https://github.com/pytorch/pytorch/pull/116222 added support for eager NCCL comm init (performed as soon as `init_process_group` is called). If any user cares about the time difference and want to see NCCL init errors early, they can use eager init now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117699 Approved by: https://github.com/wconstab	2024-01-18 02:14:49 +00:00
FFFrog	7b0926cc3e	Fix wrong class inheritance in pyi (#116404 ) As the title stated. `f6dfbffb3b/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L153)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116404 Approved by: https://github.com/ezyang, https://github.com/wconstab	2024-01-12 21:25:29 +00:00
Chip Turner	9693b3740b	[easy] [c10d] Add documentation for the `device_id` parameter for `init_process_group` (#116222 ) Follow-up to add missing docs for #114916 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116222 Approved by: https://github.com/kwen2501, https://github.com/fduwjj	2024-01-03 19:32:18 +00:00
Aaron Gokaslan	bbe3261dd3	[BE]: Use `iterable.chain.from_iterable` where possible (#116376 ) This is more readable and more efficient when dealing with lots of sequences to chain together. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116376 Approved by: https://github.com/albanD	2023-12-27 19:20:07 +00:00
fduwjj	f6dfbffb3b	[c10d] Add hashing as a debug feature for before and after NCCL collective call (#113238 ) For now, we use `TORCH_DISTRIBUTED_DEBUG = DETAIL` to turn a debug feature which calculate the hashing for input tensors and output results of c10d collective in NCCL. This is a debugging feature so that we can rule out the bug from c10d level. <img width="840" alt="image" src="https://github.com/pytorch/pytorch/assets/6937752/cdc70b0b-ae3c-4efd-86ff-adc5c5ba505f"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/113238 Approved by: https://github.com/wconstab, https://github.com/fegin	2023-12-25 22:25:38 +00:00
Lucas Pasqualin	8452f41305	Adds allreduce to inductor remap (#115950 ) Fixes #115728 Implements a rewrite path for allreduce Pull Request resolved: https://github.com/pytorch/pytorch/pull/115950 Approved by: https://github.com/wconstab	2023-12-18 22:00:22 +00:00

1 2 3 4 5 ...

332 Commits