gather_object is problematic when used with Tensors as they can unpickle on the wrong
device and lead to deadlocks or spurious failures.
This change introduces a RPC workaround for EFA when initing TensorPipe until
they properly address it.
Fixes#73935
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77272
Approved by: https://github.com/pritamdamania87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64128
This PR implements a sharded nn.Linear layer using ShardedTensors with
the following limitations:
1) Works only for ChunkShardingSpec.
2) Implementation is only aimed to demonstrate functionality and is most likely
not performant at all.
The PR also introduces a `shard_parameter` API to easily shard parameters of
`nn.Modules`. This also has the following limitations:
1) Works only for ChunkShardingSpec.
2) Is not performant since it uses broadcast instead of scatter since
ProcessGroupNCCL doesn't yet support scatter.
Overall user API for running a sharded linear would be something like this:
```
# SPMD programming paradigm running same code on all nodes.
fc = nn.Linear(10, 10)
# Setup sharding.
sharding_spec=ChunkShardingSpec(...)
shard_parameter(fc, 'weight', sharding_spec, src_rank=0)
# Run as a normal linear layer.
inp = torch.rand(10, 10)
output = fc(inp)
```
ghstack-source-id: 138500985
Test Plan:
1) unit tests.
2) waitforbuildbot
Reviewed By: wanchaol, bowangbj
Differential Revision: D30621215
fbshipit-source-id: 1aa7478568c18a4572f6c3462fdf24a4cbde01d6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62927
As part of the ShardedTensor work, we realized we do need some sort of
_RemoteDevice structure that deals with our format of "workername/device" so
that users don't have to worry about parsing this string directly.
Right now this structure is just the bare minimum and is mostly a container for
describing a remote device. It is currently only used in ShardedTensor,
ShardingSpec and RemoteModule.
Once we actually have a consolidated remote device proposal, this class can be
extended appropriately if needed.
ghstack-source-id: 135534086
Test Plan:
1) unit tests
2) waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D30170689
fbshipit-source-id: 1ac2e81c7a597dc40bf3fbf2c1168c382c66649f