pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Xuehai Pan	267f82b860	[BE] Format `.ci/` / `.github/` / `benchmarks/` / `functorch/` / `tools/` / `torchgen/` with `ruff format` (#132577 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132577 Approved by: https://github.com/malfet	2024-10-11 18:30:26 +00:00
Xuehai Pan	c0ed38e644	[BE][Easy][3/19] enforce style for empty lines in import segments in `benchmarks/` (#129754 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754 Approved by: https://github.com/ezyang	2024-07-17 14:34:42 +00:00
Aaron Gokaslan	6c2a8b6b38	[Ez][BE]: Enable new stable ruff rules (#129825 ) Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825 Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet	2024-07-02 14:47:10 +00:00
Yifu Wang	bbd47f7b2f	Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762 ) This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762 Approved by: https://github.com/wanchaol	2024-06-25 22:32:21 +00:00
Ke Wen	01601ebd41	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-07 08:11:58 +00:00
PyTorch MergeBot	0ff60236ab	Revert "Retire torch.distributed.pipeline (#127354 )" This reverts commit `b9c058c203`. Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit `b9c058c203` ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))	2024-06-04 18:19:31 +00:00
Ke Wen	b9c058c203	Retire torch.distributed.pipeline (#127354 ) Actually retiring module after deprecation warning for a while. The new supported module is: torch.distributed.pipelining. Please migrate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354 Approved by: https://github.com/wconstab	2024-06-04 07:03:26 +00:00
Xuehai Pan	26f4f10ac8	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980	2024-05-27 14:49:57 +00:00
PyTorch MergeBot	55c0ab2887	Revert "[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 )" This reverts commit `7763c83af6`. Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))	2024-05-27 09:22:08 +00:00
Xuehai Pan	7763c83af6	[5/N][Easy] fix typo for `usort` config in `pyproject.toml` (`kown` -> `known`): sort torch (#127126 ) The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126 Approved by: https://github.com/kit1980 ghstack dependencies: #127122, #127123, #127124, #127125	2024-05-27 04:22:18 +00:00
Yifu Wang	4a09117d16	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-24 18:33:18 +00:00
PyTorch MergeBot	1b29c16e5e	Revert "Introduce ProcessGroupCudaP2P (#122163 )" This reverts commit `2dd2699860`. Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))	2024-05-23 16:06:14 +00:00
Yifu Wang	2dd2699860	Introduce ProcessGroupCudaP2P (#122163 ) ## Context This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers. The stack contains several components: - `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining. - `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops. - Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops. To enable the prototype feature: - Set the distributed backend to `cuda_p2p`. - Set `torch._inductor.config._micro_pipeline_tp` to `True`. NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved. ## Benchmark Setup: - 8 x H100 (500W) + 3rd gen NVSwitch. - Llama3 8B training w/ torchtitan. - 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose. Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0 <img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1"> Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn <img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2"> ## This PR `ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA. `ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it. Usage: ``` # Using ProcessGroupCudaP2P dist.init_process_group(backend="cuda_p2p", ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options pg_options = ProcessGroupCudaP2P.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options pg_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Using ProcessGroupCudaP2P while specifying both # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options pg_options = ProcessGroupCudaP2P.Options() pg_options.nccl_options = ProcessGroupNCCL.Options() dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...) # Down-casting the backend to access p2p buffers for cuda_p2p specific # optimizations if is_cuda_p2p_group(group): backend = get_cuda_p2p_backend(group) if required_p2p_buffer_size > backend.get_buffer_size(): # fallback p2p_buffer = backend.get_p2p_buffer(...) else: # fallback ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163 Approved by: https://github.com/wanchaol	2024-05-22 09:33:05 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Yifu Wang	c58b0ac7c2	IntraNodeComm primitives for allgather_matmul (#118038 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/118038 Approved by: https://github.com/wanchaol	2024-04-04 00:46:08 +00:00
Aaron Gokaslan	6de28e92d2	[BE]: Apply FURB118 (prev): replaces unnecessary lambdas with operator. (#116027 ) This replaces a bunch of unnecessary lambdas with the operator package. This is semantically equivalent, but the operator package is faster, and arguably more readable. When the FURB rules are taken out of preview, I will enable it as a ruff check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116027 Approved by: https://github.com/malfet	2023-12-20 19:35:08 +00:00
Edward Z. Yang	dd3a77bc96	Apply UFMT to all files in benchmarks/ (#105928 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105928 Approved by: https://github.com/albanD	2023-07-26 01:18:48 +00:00
Justin Chu	5ef023b05a	[BE] Enable ruff's UP rules and autoformat benchmarks/ (#105429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105429 Approved by: https://github.com/malfet	2023-07-19 04:46:37 +00:00
Aaron Gokaslan	597b558c51	[BE]: Update flake8 and plugins and fix bugs (#97795 ) Update flake8 and flake8-plugins in lintrunner to a modern version. Enables more checks and makes flake8 checks significantly faster. Added a few additional rule ignores that will need to be fixed in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97795 Approved by: https://github.com/alexsio27444, https://github.com/janeyx99, https://github.com/ezyang	2023-03-28 23:51:55 +00:00
Aaron Gokaslan	67d9790985	[BE] Apply almost all remaining flake8-comprehension checks (#94676 ) Applies the remaining flake8-comprehension fixes and checks. This changes replace all remaining unnecessary generator expressions with list/dict/set comprehensions which are more succinct, performant, and better supported by our torch.jit compiler. It also removes useless generators such as 'set(a for a in b)`, resolving it into just the set call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94676 Approved by: https://github.com/ezyang	2023-02-12 01:01:25 +00:00
Xuehai Pan	8d45f555d7	[BE] [1/3] Rewrite `super()` calls in caffe2 and benchmarks (#94587 ) Rewrite Python built-in class `super()` calls. Only non-semantic changes should be applied. - #94587 - #94588 - #94592 Also, methods with only a `super()` call are removed: ```diff class MyModule(nn.Module): - def __init__(self): - super().__init__() - def forward(self, ...): ... ``` Some cases that change the semantics should be kept unchanged. E.g.: `f152a79be9/caffe2/python/net_printer.py (L184-L190)` `f152a79be9/test/test_jit_fuser_te.py (L2628-L2635)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/94587 Approved by: https://github.com/ezyang	2023-02-11 18:19:48 +00:00
Xuehai Pan	a229b4526f	[BE] Prefer dash over underscore in command-line options (#94505 ) Preferring dash over underscore in command-line options. Add `--command-arg-name` to the argument parser. The old arguments with underscores `--command_arg_name` are kept for backward compatibility. Both dashes and underscores are used in the PyTorch codebase. Some argument parsers only have dashes or only have underscores in arguments. For example, the `torchrun` utility for distributed training only accepts underscore arguments (e.g., `--master_port`). The dashes are more common in other command-line tools. And it looks to be the default choice in the Python standard library: `argparse.BooleanOptionalAction`: `4a9dff0e5a/Lib/argparse.py (L893-L895)` ```python class BooleanOptionalAction(Action): def __init__(...): if option_string.startswith('--'): option_string = '--no-' + option_string[2:] _option_strings.append(option_string) ``` It adds `--no-argname`, not `--no_argname`. Also typing `_` need to press the shift or the caps-lock key than `-`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94505 Approved by: https://github.com/ezyang, https://github.com/seemethere	2023-02-09 20:16:49 +00:00
Aaron Gokaslan	8fce9a09cd	[BE]: pyupgrade Python to 3.8 - imports and object inheritance only (#94308 ) Apply parts of pyupgrade to torch (starting with the safest changes). This PR only does two things: removes the need to inherit from object and removes unused future imports. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94308 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-02-07 21:10:56 +00:00
Yanli Zhao	2004df9097	Remove python ddp (#91663 ) As it is not used by anyone and also it is not maintained by PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/91663 Approved by: https://github.com/rohan-varma	2023-01-04 05:22:30 +00:00
Sergii Dymchenko	30edd39bdc	Fix non-existing parameters in docstrings in benchmarks (#91115 ) This is a continuation of https://github.com/pytorch/pytorch/pull/90505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/91115 Approved by: https://github.com/clee2000	2022-12-20 02:07:32 +00:00
Kazuaki Ishizaki	14d5f139d2	Fix typos under benchmarks, test, and tools directories (#87975 ) This PR fixes typos in `.md` files under benchmarks, test, and tools directories Pull Request resolved: https://github.com/pytorch/pytorch/pull/87975 Approved by: https://github.com/kit1980	2022-10-29 01:26:17 +00:00
Sergii Dymchenko	591222f5d9	Fix use-dict-literal lint (#83718 ) Fix use-dict-literal pylint suggestions by changing `dict()` to `{}`. This PR should do the change for every Python file except test/jit/test_list_dict.py, where I think the intent is to test the constructor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/83718 Approved by: https://github.com/albanD	2022-08-24 00:26:46 +00:00
Yulv-git	ac2d2e3a3d	Fix some typos. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/75561 Approved by: https://github.com/albanD	2022-04-11 21:55:59 +00:00
Rodrigo Berriel	a0dea074b2	Remove `.data` from benchmarks and tensorboard (#65389 ) Summary: Related to https://github.com/pytorch/pytorch/issues/30987 and https://github.com/pytorch/pytorch/issues/33628. Fix the following tasks: - Remove the use of `.data` in all our internal code: - [x] `benchmarks/` - [x] `torch/utils/tensorboard/` cc pietern mrshenli pritamdamania87 zhaojuanmao satgera rohan-varma gqchen aazzolini osalpekar jiayisuse SciPioneer H-Huang gcramer23 albanD gchanan Pull Request resolved: https://github.com/pytorch/pytorch/pull/65389 Reviewed By: soulitzer Differential Revision: D31093464 Pulled By: albanD fbshipit-source-id: 3a9c8834fd544a59a1cc2b930ae538fd1d46b232	2021-09-22 11:16:59 -07:00
Sean Lawlor	34c9f5a8da	[DDP Communication Hook] Update get_tensor and set_tensor to be cleaner naming conventions (buffer() and set_buffer()) (#62662 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62662 Replaced the methods set_tensor(.) and get_tensor() in the python exposed API from the C++ logic with buffer() and set_buffer(.) to be a cleaner interface. Reviewed By: SciPioneer Differential Revision: D30012869 fbshipit-source-id: bd8efab583dd89c96f9aeb3dd48a12073f0b1482	2021-08-04 09:27:31 -07:00
Bo Wang	e098e9000b	Compare DDP static graph (C++ core) with legacy DDP forward and backward delay. (#61507 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61507 Benchmark Python-only DDP vs production C++ based DistributedDataParallel. - Implemented a pure python DDP: PythonDDP with support of SYNC and ASYNC reduction - Added compare_ddp to measure the difference in forward and backward step Kudos on Shen and Yi for the great idea. Test Plan: Test on DevGPUS with 2 CUDA devices. $python compare_ddp.py Python only DDP has slightly better (-1%) forward performance and slightly slower (2%-20%) backward performance. This suggested that we need to keep C++ Core since the maximum latency increase can be 20%. See README.md for details. Imported from OSS Differential Revision: D29685364 D29685364 Reviewed By: mrshenli Pulled By: bowangbj fbshipit-source-id: 429e4473fac0ec4c70d6db12d946d2636dd6477a	2021-07-15 12:52:22 -07:00
Garrett Cramer	5a5c7f563d	add trainer hook functions (#60785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60785 This pr adds hook functions for the trainers. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29697299 Pulled By: gcramer23 fbshipit-source-id: cc3b991aad0d32503fbfc5acd4fca8b404e74c0f	2021-07-14 13:19:17 -07:00
Garrett Cramer	304c02ee44	refactor ps benchmark (#60784 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60784 This pr refactors the ps benchmark for modular trainers. Test Plan: Imported from OSS Reviewed By: zou3519 Differential Revision: D29697291 Pulled By: gcramer23 fbshipit-source-id: 64579a1f5326d3cd9f32936dcf53bc243d54b71d	2021-07-14 13:19:13 -07:00
Basil Hosmer	cab926b2c0	faster generate_square_subsequent_mask in nn.Transformer (#60631 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60631 Per #48360, speed up `Transformer.generate_square_subsequent_mask`. New impl is informally ~5x faster, though absolute difference is probably small. PR includes Python and C++ versions as well as a couple of places where the previous impl had been copied around. Test Plan: Imported from OSS Reviewed By: jbschlosser, albanD Differential Revision: D29356673 Pulled By: bhosmer fbshipit-source-id: 4c062ba0ead61a445aeef451c78777bf0b3a631e	2021-06-25 16:07:01 -07:00
Garrett Cramer	4ed2d5d9bb	ps sparse rpc (#58003 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58003 adds trainer class DdpTrainer adds trainer class DdpSparseRpcTrainer adds server class ParameterServerBase adds server class AverageParameterServer adds experiment ddp_cpu_sparse_rpc_nccl_allreduce adds experiment ddp_cuda_sparse_rpc_nccl_allreduce quip document https://fb.quip.com/iQUtAeKIxWpF Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D29379696 Pulled By: gcramer23 fbshipit-source-id: 9cf5fb7398ba2fa3eb694afbddc4ed00d97f205f	2021-06-24 17:21:49 -07:00
Zachary Kneupper	b8d56572a1	Open json config file in context manager (#58077 ) Summary: * Open json config file safely using a context manager (using a with block). * This will make sure that the file closed even if an exception is raised. Pull Request resolved: https://github.com/pytorch/pytorch/pull/58077 Reviewed By: anjali411 Differential Revision: D28711177 Pulled By: H-Huang fbshipit-source-id: 597ba578311b1f1d6706e487872db4e784c78c3c	2021-05-26 08:58:40 -07:00
Horace He	79a258f448	s/foward/forward/g (#58497 ) Summary: Annoying typo. Prompted by these profiling results: https://github.com/pytorch/pytorch/issues/56419#issuecomment-825787828 Pull Request resolved: https://github.com/pytorch/pytorch/pull/58497 Reviewed By: malfet Differential Revision: D28521081 Pulled By: Chillee fbshipit-source-id: ab91a2e167dd7d3387fd56106a6cff81f7a32f10	2021-05-19 11:42:42 -07:00
Garrett Cramer	16d617c3e5	test experiment script (#57925 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57925 1. adds test_scripts.py that will run added scripts and verify that there are no errors 2. adds local ddp_nccl_allreduce experiment script test with command `pytest test_scripts.py` Test Plan: Imported from OSS Reviewed By: agolynski Differential Revision: D28382452 Pulled By: gcramer23 fbshipit-source-id: 21028a990ebfedf1aad6b007a723c02403e8bea8	2021-05-12 10:22:47 -07:00
Garrett Cramer	bc2540f0be	benchmark rpc ps (#57454 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57454 DDP with NCCL AllReduce for the entire model experiment from Quip https://fb.quip.com/iQUtAeKIxWpF I have been testing this on the AI cluster. There seem to be some connection problems with RPC when using multiple trainers or parameter servers. ``` Namespace(bconfig_id='3', dconfig_id='DummyData', mconfig_id='DummyModel', pconfig_id='None', tconfig_id='DdpNcclTrainer') benchmark warmup done metrics for trainer=0 +-----------------------------------+----------+---------+----------+------------+-----------+ \| name \| min \| max \| mean \| variance \| stdev \| +===================================+==========+=========+==========+============+===========+ \| backward_metric,backward \| 2.45248 \| 4.18304 \| 3.972 \| 0.097122 \| 0.311644 \| +-----------------------------------+----------+---------+----------+------------+-----------+ \| batch_level_metric,batch_all \| 4.11955 \| 4.58138 \| 4.31439 \| 0.00229848 \| 0.0479424 \| +-----------------------------------+----------+---------+----------+------------+-----------+ \| foward_metric,forward_pass \| 0.141312 \| 1.4807 \| 0.222566 \| 0.0555432 \| 0.235676 \| +-----------------------------------+----------+---------+----------+------------+-----------+ \| hook_future_metric,nccl_allreduce \| 0.191488 \| 3.54099 \| 3.11694 \| 0.557106 \| 0.746395 \| +-----------------------------------+----------+---------+----------+------------+-----------+ metrics for trainer=1 +-----------------------------------+----------+---------+----------+-------------+------------+ \| name \| min \| max \| mean \| variance \| stdev \| +===================================+==========+=========+==========+=============+============+ \| backward_metric,backward \| 2.4617 \| 2.59174 \| 2.51196 \| 0.000938276 \| 0.0306313 \| +-----------------------------------+----------+---------+----------+-------------+------------+ \| batch_level_metric,batch_all \| 4.22605 \| 4.71757 \| 4.27921 \| 0.00468424 \| 0.0684415 \| +-----------------------------------+----------+---------+----------+-------------+------------+ \| foward_metric,forward_pass \| 0.807936 \| 1.50118 \| 0.846008 \| 0.00601693 \| 0.0775688 \| +-----------------------------------+----------+---------+----------+-------------+------------+ \| hook_future_metric,nccl_allreduce \| 0.108544 \| 0.1536 \| 0.11222 \| 2.16726e-05 \| 0.00465538 \| +-----------------------------------+----------+---------+----------+-------------+------------+ metrics for all trainer +-----------------------------------+----------+---------+----------+------------+-----------+ \| name \| min \| max \| mean \| variance \| stdev \| +===================================+==========+=========+==========+============+===========+ \| backward_metric,backward \| 2.45248 \| 4.18304 \| 3.24198 \| 0.584391 \| 0.764455 \| +-----------------------------------+----------+---------+----------+------------+-----------+ \| batch_level_metric,batch_all \| 4.11955 \| 4.71757 \| 4.2968 \| 0.00378467 \| 0.0615197 \| +-----------------------------------+----------+---------+----------+------------+-----------+ \| foward_metric,forward_pass \| 0.141312 \| 1.50118 \| 0.534287 \| 0.128284 \| 0.358167 \| +-----------------------------------+----------+---------+----------+------------+-----------+ \| hook_future_metric,nccl_allreduce \| 0.108544 \| 3.54099 \| 1.61458 \| 2.5456 \| 1.59549 \| +-----------------------------------+----------+---------+----------+------------+-----------+ ``` Test Plan: Imported from OSS Reviewed By: H-Huang, ngimel Differential Revision: D28296175 Pulled By: gcramer23 fbshipit-source-id: 5dd208fc86f8b5558d7c8860d685bb25c2e09fe7	2021-05-07 19:58:40 -07:00
Sam Estep	75024e228c	Add lint for unqualified `type: ignore` (#56290 ) Summary: The other half of https://github.com/pytorch/pytorch/issues/56272. Pull Request resolved: https://github.com/pytorch/pytorch/pull/56290 Test Plan: CI should pass on the tip of this PR, and we know that the lint works because the following CI runs (before this PR was finished) failed: - https://github.com/pytorch/pytorch/runs/2384511062 - https://github.com/pytorch/pytorch/actions/runs/765036024 Reviewed By: seemethere Differential Revision: D27867219 Pulled By: samestep fbshipit-source-id: e648f07b6822867e70833e23ddafe7fb7eaca235	2021-04-21 08:07:23 -07:00
Sam Estep	8c798e0622	Forbid trailing whitespace (#53406 ) Summary: Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857 These are the only hand-written parts of this diff: - the addition to `.github/workflows/lint.yml` - the file endings changed in these four files (to appease FB-internal land-blocking lints): - `GLOSSARY.md` - `aten/src/ATen/core/op_registration/README.md` - `scripts/README.md` - `torch/csrc/jit/codegen/fuser/README.md` The rest was generated by running this command (on macOS): ``` git grep -I -l ' $' -- . ':(exclude)/contrib/' ':(exclude)third_party' \| xargs gsed -i 's/ *$//' ``` I looked over the auto-generated changes and didn't see anything that looked problematic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406 Test Plan: This run (after adding the lint but before removing existing trailing spaces) failed: - https://github.com/pytorch/pytorch/runs/2043032377 This run (on the tip of this PR) succeeded: - https://github.com/pytorch/pytorch/runs/2043296348 Reviewed By: walterddr, seemethere Differential Revision: D26856620 Pulled By: samestep fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97	2021-03-05 17:22:55 -08:00
Rohan Varma	5021582fe6	Fix benchmarks/distributed/ddp/benchmark.py (#51095 ) Summary: Fixes the issue reported in https://github.com/pytorch/pytorch/issues/50679 by using built-in object-based collectives. User has verified this patch works Test with: RANK=0 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456 RANK=1 python3 pytorch-dist-benchmark.py --world-size 2 --master-addr 127.0.0.1 --master-port 23456 Pull Request resolved: https://github.com/pytorch/pytorch/pull/51095 Reviewed By: SciPioneer Differential Revision: D26070275 Pulled By: rohan-varma fbshipit-source-id: 59abcaac9e395bcdd8a018bf6ba07521d94b2fdf	2021-01-29 11:10:13 -08:00
Pritam Damania	96cedefd8e	[Pipe] Refactor convert_to_balance under non-test package. (#50860 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50860 Since fairscale.nn.Pipe still uses 'balance' and 'devices' parameters, other frameworks like fairseq still use these parameters. As a result, the `convert_to_balance` method is a nice utility to use for migrating to PyTorch Pipe without changing a lot of code in other frameworks. In addition to this I've renamed the method to be more illustrative of what it does and also allowed an optional devices parameter. ghstack-source-id: 120430775 Test Plan: 1) waitforbuildbot 2) Tested with fairseq Reviewed By: SciPioneer Differential Revision: D25987273 fbshipit-source-id: dccd42cf1a74b08c876090d3a10a94911cc46dd8	2021-01-28 12:10:21 -08:00
Oscar Sandoval	09f4844c1f	Pytorch Distributed RPC Reinforcement Learning Benchmark (Throughput and Latency) (#46901 ) Summary: A Pytorch Distributed RPC benchmark measuring Agent and Observer Throughput and Latency for Reinforcement Learning Pull Request resolved: https://github.com/pytorch/pytorch/pull/46901 Reviewed By: mrshenli Differential Revision: D25869514 Pulled By: osandoval-fb fbshipit-source-id: c3b36b21541d227aafd506eaa8f4e5f10da77c78	2021-01-11 19:02:36 -08:00
skyline75489	46b83212d1	Remove unused six code for Python 2/3 compatibility (#48077 ) Summary: This is basically a reborn version of https://github.com/pytorch/pytorch/issues/45254 . Ref: https://github.com/pytorch/pytorch/issues/42919 Pull Request resolved: https://github.com/pytorch/pytorch/pull/48077 Reviewed By: ngimel Differential Revision: D25687042 Pulled By: bugra fbshipit-source-id: 05f20a6f3c5212f73d0b1505b493b720e6cf74e5	2020-12-22 18:07:08 -08:00
mrshenli	e4eaa6de5f	Fix lint (#49629 ) Summary: Fix lint on master Pull Request resolved: https://github.com/pytorch/pytorch/pull/49629 Reviewed By: rohan-varma Differential Revision: D25654199 Pulled By: mrshenli fbshipit-source-id: 2ab5669ad47996c0ca0f9b6611855767d5af0506	2020-12-18 19:26:06 -08:00
Pritam Damania	159de1f1d6	Add benchmark for torch.distributed.pipeline.sync.Pipe (#49577 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49577 Repurposing the benchmarking from https://github.com/facebookresearch/fairscale/blob/master/benchmarks/pipe.py and pulling in a stripped down version of the benchmark into PyTorch. Sample output: ``` Running benchmark with args: Namespace(batch_size=8, checkpoint='never', chunks=4, host='localhost', max_batch=10, num_decoder_layers=10, num_devices=4) Number of parameters for model: 292833040 \| batch 1 \| wps 3593.07 \| loss 25.98 \| ppl 192556591553.37 \| batch 2 \| wps 4405.16 \| loss 19.36 \| ppl 256201548.33 \| batch 3 \| wps 4404.98 \| loss 23.56 \| ppl 17111244076.37 \| batch 4 \| wps 4413.25 \| loss 27.11 \| ppl 594561327825.83 \| batch 5 \| wps 4408.53 \| loss 25.92 \| ppl 181277705101.33 \| batch 6 \| wps 4385.64 \| loss 24.92 \| ppl 66592883598.50 \| batch 7 \| wps 4434.11 \| loss 24.75 \| ppl 56113635884.68 \| batch 8 \| wps 4441.25 \| loss 24.88 \| ppl 63666024212.82 \| batch 9 \| wps 4425.49 \| loss 25.35 \| ppl 101959669008.98 \| batch 10 \| wps 4421.05 \| loss 25.34 \| ppl 101597621863.94 Peak memory usage for GPUs: cuda:0: 2.38GiB, cuda:1: 3.04GiB, cuda:2: 3.04GiB, cuda:3: 3.67GiB, ``` ghstack-source-id: 118939686 Test Plan: sentinel Reviewed By: rohan-varma Differential Revision: D25628721 fbshipit-source-id: 41c788eed4f852aef019aec18a84cb25ad254f3a	2020-12-18 18:33:47 -08:00
elfringham	db1b0b06c4	Flake8 fixes (#48453 ) Summary: Quiet errors from flake8. Only a couple of code changes for deprecated Python syntax from before 2.4. The rest is just adding noqa markers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/48453 Reviewed By: mruberry Differential Revision: D25181871 Pulled By: ngimel fbshipit-source-id: f8d7298aae783b1bce2a46827b088fc390970641	2020-11-25 19:09:50 -08:00
Shen Li	76c7652cc5	Add distributed data parallel benchmark tool (#35198 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35198 The need for this tool was motivated by #28883. In the past, we have done ad-hoc benchmarking, but it's time for something more structured. It would be nice to add more model architectures so that we can get a full picture of the performance impact of a code change simply by running this suite a few times. Test Plan: Imported from OSS Differential Revision: D20591296 Pulled By: mrshenli fbshipit-source-id: ee66ce0ebca02086453b02df0a94fde27ab4be49	2020-04-08 15:07:03 -07:00

49 Commits