pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Xilun Wu	e799f565eb	[DTensor][TP][Random] Introduce TensorParallelRNGTracker to integrate parallel RNG state with Tensor Parallel (#103910 ) This PR enables the automatic use of `TensorParallelRNGTracker` in Tensor Parallel api. Some unit tests are going to be added to cover. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103910 Approved by: https://github.com/wanchaol, https://github.com/fduwjj	2023-06-30 08:06:41 +00:00
Xilun Wu	a66107a30c	[DTensor][Random] Introduce CudaRNGStateTracker to maintain parallel RNG state for DTensor (#103235 ) # Change This PR adds two classes to DTensor: 1. `CudaRNGStateTracker`: `CudaRNGStateTracker` stores Random Number Generator (RNG) state (a `ByteTensor` object) in a `dict`, mapping from a corresponding tag to each state tensor. It also provides a set of convenient utility methods to help access/modify the state tensors. The most important interface is `_distribute_region` which will be used when DTensor executes a random op (an operator that calls RNG). 2. `OffsetBasedRNGTracker`: This subclass of `CudaRNGStateTracker` defines the default policy of how RNG states should be shared and synchronized among all ranks to respect the semantics of DTensor random operators. # Warning - With `Multi-threaded ProcessGroup`, the global variable `_rng_tracker` will be shared among threads(ranks) and cause issue. We need to figure out a compatible solution for that. - The RNG state may be asynchronous outside of participating ranks. It is harmless in our current use case of submesh though. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103235 Approved by: https://github.com/wanchaol	2023-06-27 19:00:25 +00:00
shaoyf42	17737f9d0e	[DTensor] Allow DTensor support cuda-like device (#102468 ) Allow DTensor support cuda-like device, fix https://github.com/pytorch/pytorch/issues/102442 Currently, DTensor supports cuda and cpu. There are other efforts to make DTensor support third-party devices, for example https://github.com/pytorch/pytorch/pull/101914 and https://github.com/pytorch/pytorch/issues/101911. However, this support only extends a portion of third-party devices and is no good support for third-party cuda-like devices. Therefore, we would like to extend DTensor to support cuda-like devices, after all, cuda is so popular! 1. Similar to what is done here, we need to initialize the communication backend for the device set by DeviceMesh. So `_default_backend_for_device` is added to `Backend`. It is worth noting that when we register a new backend for a device other than cpu and cuda, we also need to add a new default backend for this device. 2. Adding `_device_handle` to `DeviceMesh` for cuda-like devices, similar to what is set in FSDP. When `_device_handle` is not None, the device has similar behavior to `cuda`. In this way, functions like `torch.cuda.device_count()` need to be modified to `device_mesh._device_handle.device_count()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102468 Approved by: https://github.com/wanchaol	2023-06-07 23:13:53 +00:00
Wanchao Liang	ff58d19c89	DeviceMesh use dispatchable PG to support custom backend (#102336 ) This PR switches DeviceMesh to use dispatchable process group instead, this could enable easier backend integration as user only need to integrate with c10d process group custom backend, without needing to change DeviceMesh to plug in the backend Pull Request resolved: https://github.com/pytorch/pytorch/pull/102336 Approved by: https://github.com/fduwjj	2023-05-30 19:22:37 +00:00
Xilun Wu	e686a1e1b3	[DTensor][2/N] add Philox offset adjustment logic in operator_dispatch (#98199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98199 Approved by: https://github.com/wanchaol	2023-04-10 23:57:04 +00:00
Xilun Wu	67963c32bd	[DTensor][1/N] add DTensor RNG state APIs (#98198 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/98198 Approved by: https://github.com/wanchaol	2023-04-10 23:57:00 +00:00

6 Commits