pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	ef3be6726f	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-04 20:05:50 +00:00
PyTorch MergeBot	34aa78274d	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit `4ae57d448c`. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Failing internal tests, probably typechecks. See D81588399 ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3253651785))	2025-09-04 13:13:52 +00:00
Edward Z. Yang	4ae57d448c	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-03 07:33:55 +00:00
PyTorch MergeBot	420c52ecf3	Revert "Make distributed modules importable even when backend not built (#159889 )" This reverts commit `626cb7df81`. Reverted https://github.com/pytorch/pytorch/pull/159889 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/159889#issuecomment-3246677982))	2025-09-02 20:24:01 +00:00
Edward Z. Yang	626cb7df81	Make distributed modules importable even when backend not built (#159889 ) This PR is greatly simplified now that it stacked on top of a PR that builds with distributed always. We only need to stub functions that may not be defined due to a backend not being enabled. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/159889 Approved by: https://github.com/wconstab ghstack dependencies: #160449	2025-09-01 23:00:21 +00:00
Paul de Supinski	768a1017c5	Allow parallel start NUMA binding (#161576 ) # Context In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per thread.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually can safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576 Approved by: https://github.com/d4l3k	2025-08-28 01:15:58 +00:00
Paul de Supinski	33346b5814	Support NUMA Binding for Callable Entrypoints, Take 2 (#161183 ) # Context In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges. Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for both `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183 Approved by: https://github.com/d4l3k	2025-08-23 07:23:22 +00:00
Paul de Supinski	58f9a3dd63	[ez] Only use default numa bindings if nproc == cuda device count (#160848 ) # Context Another fix to enable broad rollout of #149334. The implementation assumes that the trainer process with local rank `n` only uses device `cuda:n`. However, there are sometimes jobs with more than one GPU per process, in which case our assumption could be incorrect and actually lead to worse memory locality. # This PR As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160848 Approved by: https://github.com/kiukchung	2025-08-19 02:50:01 +00:00
Paul de Supinski	b26d2a9464	[ez] Make NUMA signpost parameters JSON serializable (#160710 ) # Context Broader context in #160163. In order for the _utils_internal version of signpost_event to do proper logging, its parameters argument needs to be json serializable. # This PR Convert `NumaOptions` to serializable form before inputting to `signpost_event`. # Test Plan ## Automated Added tests `$ pytest test/test_numa_binding.py`. ## Manual See [D80317206](https://www.internalfb.com/diff/D80317206). Pull Request resolved: https://github.com/pytorch/pytorch/pull/160710 Approved by: https://github.com/kiukchung	2025-08-15 16:52:43 +00:00
Paul de Supinski	7e91394955	Support NUMA Binding for Callable Entrypoints (#160163 ) # Context This is an extension of #149334. # This PR Add support for NUMA bindings with Callable entrypoints, such as `do_train` instead of `/usr/local/bin/python`. Most notably, we utilize a hack in order to force `Process.start()` to use custom NUMA bindings for each subprocess. Please search for `HACK:` in the code to see a description of the implementation we chose, and #160006 for discussion of alternatives and why this is necessary. Other changes: * Remove unnecessary `--preferred` option from all binding strategies. By default, Linux already allocates memory to the NUMA node local to the CPU which triggered the allocation. (See [MPOL_LOCAL](https://man7.org/linux/man-pages/man2/set_mempolicy.2.html).) * Refactor so that the main API is `maybe_wrap_command_with_numa_bindings`, which computes bindings for a single rank at a time, rather than `maybe_wrap_with_numa_bindings` which computed bindings for all ranks at once. This allowed for more code sharing between `Callable` and `str` entrypoints. # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual Using [this benchmark,](https://gist.github.com/pdesupinski/bbe01ade455d86e989794f2c612e2d91), ran ``` $ PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -m torch.distributed.run --standalone --nproc-per-node=8 --numa-binding=node --run-path mlp_train.py 2>&1 \| tee node_callable.txt && PYTHONUNBUFFERED=1 LOGLEVEL=INFO perf stat -e ls_dmnd_fills_from_sys.dram_io_far,ls_dmnd_fills_from_sys.dram_io_near -- python -u -m torch.distributed.run --standalone --nproc-per-node=8 --run-path mlp_train.py 2>&1 \| tee none_callable.txt ``` and observed * 6.6% remote memory accesses with 'node' bindings * 11.6% remote without bindings I also ran similar with `str` entrypoints as before just to be sure it's still working. NOTE: [--run-path triggers the code to be run inside a `Callable`.](`017259f9c6/torch/distributed/run.py (L870)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/160163 Approved by: https://github.com/d4l3k	2025-08-12 20:08:49 +00:00
raghavhrishi	7ef3c3357d	NUMA binding integration with elastic agent and torchrun (#149334 ) Implements #148689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149334 Approved by: https://github.com/d4l3k Co-authored-by: Paul de Supinski <pdesupinski@gmail.com>	2025-07-25 21:19:49 +00:00

11 Commits