Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47896
Per title
ghstack-source-id: 116710141
Test Plan: CI
Reviewed By: osalpekar
Differential Revision: D24943323
fbshipit-source-id: 7bf33ce3a021b9750b65e0c08f602c465cd81d28
Summary:
If world_size is lesser than or equal to number of GPU's available
then the rank can be directly mapped to corresponding GPU.
This fixes the issue referenced in https://github.com/pytorch/pytorch/issues/45435 and https://github.com/pytorch/pytorch/issues/47629
For world_size = 3 and number of GPU's = 8, the rank to GPU mapping
will be 0,2,4. This is due to the introduction of barrier,
(refer PR https://github.com/pytorch/pytorch/issues/45181)
the tensors in barrier is mapped to cuda0,1,2 and the tensors in the
actual test cases are mapped to cuda0,2,4 resulting in different streams and
leading to timeout. This issue is specific to default process group.
Issue is not observed in new process group since the streams are created again
after the initial barrier call.
This patch maps the rank to corresponding GPU's when the world_size is
less than or equal to the number of GPU's, in this case 0,1,2
Note: The barrier function in distributed_c10d.py should include new parameter
to specify the tensor or rank to GPU mapping. In that case, this patch will be
redundant but harmless since the tests can specify the tensors with appropriate
GPU rankings.
Fixes https://github.com/pytorch/pytorch/issues/47629
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47898
Reviewed By: smessmer
Differential Revision: D24956021
Pulled By: rohan-varma
fbshipit-source-id: a88257f22a7991ba36566329766c106d3360bb4e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46804
As per our design in https://github.com/pytorch/pytorch/issues/44827,
changign the API such that the user places modules on appropriate devices
instead of having a `balance` and `devices` parameter that decides this.
This design allows us to use RemoteModule in the future.
ghstack-source-id: 116479842
Test Plan: waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D24524219
fbshipit-source-id: 9973172c2bb7636572cdc37ce06bf8368638a463
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47797
NCCL p2p tests had hang issues before, the reason is that there were some unexpected context switches. For example, process 1 which is supposed to only use GPU1 could use GPU0 as a result of missing explicitly setting device.
ghstack-source-id: 116461969
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24863808
fbshipit-source-id: 92bd3a4874be8334210c7c8ee6363648893c963e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47394
This is a preliminary refactor for the next diff that will add an
additional flag to control whether we throw a StopIteration or not. We
basically move the flags for ddp uneven inputs to a simple class.
ghstack-source-id: 116428177
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D24739509
fbshipit-source-id: 96bf41bd1c02dd27e68f6f37d08e22f33129b319
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47470
Reland of https://github.com/pytorch/pytorch/pull/47206, which was reverted due to failing multigpu tests.
The fix to make multigpu tests work is to compare against `torch.tensor([world_size, 0])`, not hardcode `torch.tensor([2, 0]` which assumes a world size of 2.
Original commit description:
As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with find_unused_parameters=True.
ghstack-source-id: 115993934
ghstack-source-id: 115993934
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D24767893
fbshipit-source-id: 7d7a2449270eb3e72b5061694e897166e16f9bbc
Summary:
Added a convenience function that allows users to load models without DP/DDP from a DP/DDP state dict.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45643
Reviewed By: rohan-varma
Differential Revision: D24574649
fbshipit-source-id: 17d29ab16ae24a30890168fa84da6c63650e61e9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47206
As discussed offline with pritamdamania87, add testing to ensure per-iteration and rank-dependent control flow works as expected in DDP with `find_unused_parameters=True`.
ghstack-source-id: 115854944
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D24659901
fbshipit-source-id: 17fc2b3ebba9cef2dd01d2877bad5702174b9767
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46755
As reported in https://github.com/pytorch/pytorch/issues/41324, there is a bug in DDP when `find_unused_parameters=True` and 2 or more parameters share the same gradient accumulator.
In the reducer, we currently keep a mapping of grad accumulator to index and populate it with map[accumulator] = index, but this overwrites indices when the accumulator is the same. To fix this, switch the mapping values to a vector of indices to hold all such indices that share the same accumulator.
ghstack-source-id: 115453567
Test Plan: Added UT
Reviewed By: pritamdamania87
Differential Revision: D24497388
fbshipit-source-id: d32dfa9c5cd0b7a8df13c7873d5d28917b766640
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46773
Changed the constructor of RemoteModule to accept a `remote_device` arg in the following format:
"<workername>/<device>" (e.g., "trainer0/cpu", "ps0/cuda:0")
This arg merges the original `on` and `device` arg.
Original PR issue: RemoteDevice Format #46554
ghstack-source-id: 115448051
Test Plan: buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
Reviewed By: pritamdamania87
Differential Revision: D24482562
fbshipit-source-id: 5acfc73772576a4b674df27625bf560b8f8e67c1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46897
These APIs implicitly assumed that gpu for rank == rank index, but
that is not necessarily true. For example, the first GPU could be used for a
different purpose and rank 0 could use GPU 1, rank 1 uses GPU 2, etc. Thus, we
mandate that the user specify the device to use via `torch.cuda.set_device()`
before making calls to this API. This expectation should be okay since we
clearly document it, and we expect the user to set this for
DistributedDataParallel as well.
Also adds/tidies up some documentation.
ghstack-source-id: 115359633
Test Plan: Modified unittests
Reviewed By: divchenko
Differential Revision: D24556177
fbshipit-source-id: 7e826007241eba0fde3019180066ed56faf3c0ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46568
This PR adds support for an RRef.backward() API. This would be useful
in applications like pipeline parallelism as described here:
https://github.com/pytorch/pytorch/issues/44827
This PR only adds support for local RRefs, remote RRef support will be added in
a follow up PR.
ghstack-source-id: 115100729
Test Plan:
1) unit tests.
2) waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D24406311
fbshipit-source-id: fb0b4e185d9721bf57f4dea9847e0aaa66b3e513
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/41807
Test Plan: Make sure ci tests pass, including newly written test
Reviewed By: mrshenli
Differential Revision: D22640839
Pulled By: osandoval-fb
fbshipit-source-id: 3ff98d8e8c6e6d08575e307f05b5e159442d7216
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46372
Currently, in `_run_function`, we catch an exception from the python
function which is run, and report it back to the master. However in some large
scale training jobs, it would be valuable to also log the error on the trainer
itself for faster debugging.
Test Plan: Added unittest.
Reviewed By: pritamdamania87
Differential Revision: D24324578
fbshipit-source-id: 88460d7599ea69d2c38fd9c10eb6471f7edd4100
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46304
In the case that a single process operates only on one GPU, we can
avoid this scatter and instead replace it with a recursive version of `to`
which transfers the input tensors to the correct device.
The implementation of `_recursive_to` is modeled after `scatter` in https://github.com/pytorch/pytorch/blob/master/torch/nn/parallel/scatter_gather.py, in order to keep parity with the previous conventions (i.e. custom types not having their tensors moved).
ghstack-source-id: 114896677
Test Plan: Added unittest, and CI
Reviewed By: pritamdamania87
Differential Revision: D24296377
fbshipit-source-id: 536242da05ecabfcd36dffe14168b1f2cf58ca1d
Summary:
Resolves one item in https://github.com/pytorch/pytorch/issues/46321
This PR sets up DistExamplesTest which will be used as the class to implement future tests for examples. This class is run as part of CI tests. It also creates a dist_examples folder and includes the [batch server example](https://github.com/pytorch/examples/blob/master/distributed/rpc/batch/parameter_server.py) which is slightly modified to allow to be tested.
Run test:
pytest test/distributed/rpc/test_tensorpipe_agent.py -k test_batch_updating_parameter_server -vs
pytest test/distributed/rpc/test_process_group_agent.py -k test_batch_updating_parameter_server -vs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46510
Reviewed By: mrshenli
Differential Revision: D24379296
Pulled By: H-Huang
fbshipit-source-id: 1c102041e338b022b7a659a51894422addc0e06f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45989
This test was failing internally for the Thrift-based RPC agent, since
it has a different error regex. Use `self.get_timeout_error_regex` which gets
the timeout error string for each backend to fix this.
ghstack-source-id: 114463458
Test Plan: CI
Reviewed By: pritamdamania87
Differential Revision: D24170394
fbshipit-source-id: 9b30945e3e30f36472268d042173f8175ad88098
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46221
The RPC framework only allowed sending RPCs based on provided
WorkerInfo or name. When using RPC with DDP, sometimes it might just be easier
to refer to everything in terms of ranks since DDP doesn't support names yet.
As a result, support a `to` parameter in the RPC APIs which allow for
specifying a rank as well would be helpful.
ghstack-source-id: 114207172
Test Plan:
1) waitforbuildbot
2) Unit Tests
Reviewed By: mrshenli
Differential Revision: D24264989
fbshipit-source-id: 5edf5d92e2bd2f213471dfe7c74eebfa9efc9f70
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45994
Send/Recv tests were disabled because of the https://github.com/pytorch/pytorch/issues/42517. With that issue fixed, this diff enables those tests.
ghstack-source-id: 113970569
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24172484
fbshipit-source-id: 7492ee2e9bf88840c0d0086003ce8e99995aeb91
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45933
Occasionally users run DDP with models with unused params, in this
case we would like to surface an error message telling them to run with
find_unused_params=True. However, a recent change to rebuild_buckets logic (https://github.com/pytorch/pytorch/pull/44798) made
it so that we raise a size mismatch error when this happens, but the
information about unused parameters is likely to be more useful and likely to
be the most common case of failure. Prefer raising this error over the
subsequent size mismatch errors.
ghstack-source-id: 113914759
Test Plan: Added unittest
Reviewed By: mrshenli
Differential Revision: D24151256
fbshipit-source-id: 5d349a988b4aac7d3e0ef7b3cd84dfdcbe9db675
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45873
This diff adds support for sending/receiving to/from self. It also fixed a bug when p2p operations are not used by all processes.
ghstack-source-id: 113910526
Test Plan: waitforsandcastle
Reviewed By: jiayisuse
Differential Revision: D24124413
fbshipit-source-id: edccb830757ac64f569e7908fec8cb2b43cd098d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45783
After the previous device maps commits, `pipeWrite` might throw. In
this case, if we increment active calls before `pipeWrite` on the
caller, that active call won't be decremented properly when `pipeWrite`
throws. As a result, `shutdown` can silently timeout. I noticed this
as some tests take more than 60s to finish.
This commit extract the tensor device checking logic out of pipeWrite,
and make sure the error is thrown before the active call count is
incremented.
Differential Revision: D24094803
Test Plan: Imported from OSS
Reviewed By: mruberry
Pulled By: mrshenli
fbshipit-source-id: d30316bb23d2afd3ba4f5540c3bd94a2ac10969b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44921
This diff adds support for Process Group point-to-point operations on NCCL backend based on ncclSend/ncclRecv. See https://github.com/pytorch/pytorch/issues/43995 for more context.
ghstack-source-id: 113592785
Test Plan: unittest
Reviewed By: jiayisuse
Differential Revision: D23709848
fbshipit-source-id: cdf38050379ecbb10450f3394631317b41163258
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44220
Closes https://github.com/pytorch/pytorch/issues/44009
Currently if a dataloader returns objects created with a
collections.namedtuple, this will incorrectly be cast to a tuple. As a result, if we have data of these types, there can be runtime errors during the forward pass if the module is expecting a named tuple.
Fix this in
`scatter_gather.py` to resolve the issue reported in
https://github.com/pytorch/pytorch/issues/44009
ghstack-source-id: 113423287
Test Plan: CI
Reviewed By: colesbury
Differential Revision: D23536752
fbshipit-source-id: 3838e60162f29ebe424e83e474c4350ae838180b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44826
As described in https://github.com/pytorch/pytorch/issues/43690, there
is a need for DDP to be able to ignore certain parameters in the module (not
install allreduce hooks) for certain use cases. `find_unused_parameters` is
sufficient from a correctness perspective, but we can get better performance
with this upfront list if users know which params are unused, since we won't
have to traverse the autograd graph every iteration.
To enable this, we add a field `parameters_to_ignore` to DDP init and don't
pass in that parameter to reducer if that parameter is in the given list.
ghstack-source-id: 113210109
Test Plan: Added unittest
Reviewed By: xw285cornell, mrshenli
Differential Revision: D23740639
fbshipit-source-id: a0411712a8b0b809b9c9e6da04bef2b955ba5314
Summary:
In profiler, cuda did not report self time, so for composite functions there was no way to determine which function is really taking time. In addition, "total cuda time" reported was frequently more than total wallclock time. This PR adds "self CUDA time" in profiler, and computes total cuda time based on self cuda time, similar to how it's done for CPU. Also, slight formatting changes to make table more compact. Before:
```
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg CUDA total % CUDA total CUDA time avg Number of Calls
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
aten::matmul 0.17% 890.805us 99.05% 523.401ms 5.234ms 49.91% 791.184ms 7.912ms 100
aten::mm 98.09% 518.336ms 98.88% 522.511ms 5.225ms 49.89% 790.885ms 7.909ms 100
aten::t 0.29% 1.530ms 0.49% 2.588ms 25.882us 0.07% 1.058ms 10.576us 100
aten::view 0.46% 2.448ms 0.46% 2.448ms 12.238us 0.06% 918.936us 4.595us 200
aten::transpose 0.13% 707.204us 0.20% 1.058ms 10.581us 0.03% 457.802us 4.578us 100
aten::empty 0.14% 716.056us 0.14% 716.056us 7.161us 0.01% 185.694us 1.857us 100
aten::as_strided 0.07% 350.935us 0.07% 350.935us 3.509us 0.01% 156.380us 1.564us 100
aten::stride 0.65% 3.458ms 0.65% 3.458ms 11.527us 0.03% 441.258us 1.471us 300
-------------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Self CPU time total: 528.437ms
CUDA time total: 1.585s
Recorded timeit time: 789.0814 ms
```
Note recorded timeit time (with proper cuda syncs) is 2 times smaller than "CUDA time total" reported by profiler
After
```
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
aten::matmul 0.15% 802.716us 99.06% 523.548ms 5.235ms 302.451us 0.04% 791.151ms 7.912ms 100
aten::mm 98.20% 519.007ms 98.91% 522.745ms 5.227ms 790.225ms 99.63% 790.848ms 7.908ms 100
aten::t 0.27% 1.406ms 0.49% 2.578ms 25.783us 604.964us 0.08% 1.066ms 10.662us 100
aten::view 0.45% 2.371ms 0.45% 2.371ms 11.856us 926.281us 0.12% 926.281us 4.631us 200
aten::transpose 0.15% 783.462us 0.22% 1.173ms 11.727us 310.016us 0.04% 461.282us 4.613us 100
aten::empty 0.11% 591.603us 0.11% 591.603us 5.916us 176.566us 0.02% 176.566us 1.766us 100
aten::as_strided 0.07% 389.270us 0.07% 389.270us 3.893us 151.266us 0.02% 151.266us 1.513us 100
aten::stride 0.60% 3.147ms 0.60% 3.147ms 10.489us 446.451us 0.06% 446.451us 1.488us 300
-------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------
Self CPU time total: 528.498ms
CUDA time total: 793.143ms
Recorded timeit time: 788.9832 ms
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45209
Reviewed By: zou3519
Differential Revision: D23925491
Pulled By: ngimel
fbshipit-source-id: 7f9c49238d116bfd2db9db3e8943355c953a77d0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44419
Closes https://github.com/pytorch/pytorch/issues/39969
This PR adds support for propagation of input shapes over the wire when the profiler is invoked with `record_shapes=True` over RPC. Previously, we did not respect this argument.
This is done by saving the shapes as an ivalue list and recovering it as the type expected (`std::vector<std::vector<int>>` on the client). Test is added to ensure that remote ops have the same `input_shapes` as if the op were run locally.
ghstack-source-id: 112977899
Reviewed By: pritamdamania87
Differential Revision: D23591274
fbshipit-source-id: 7cf3b2e8df26935ead9d70e534fc2c872ccd6958
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45221
This PR introduces a distributed functional optimizer, so that
distributed optimizer can reuse the functional optimizer APIs and
maintain their own states. This could enable the torchscript compatible
functional optimizer when using distributed optimizer, helps getting rid
of GIL and improve overall performance of training, especially distributed
model parallel training
Test Plan: Imported from OSS
Reviewed By: ailzhang
Differential Revision: D23935256
Pulled By: wanchaol
fbshipit-source-id: 59b6d77ff4693ab24a6e1cbb6740bcf614cc624a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45353
Temporarily removing this feature, will add this back after branch cut.
Test Plan: Imported from OSS
Reviewed By: rohan-varma
Differential Revision: D23939865
Pulled By: mrshenli
fbshipit-source-id: 7dceaffea6b9a16512b5ba6036da73e7f8f83a8e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45181
`init_process_group` and `new_group` update a bunch of global
variables after initializing the actual process group. As a result, there is a
race that after initializing the process group on say rank 0, if we immediately
check the default process group on rank 1 (say via RPC), we might actually get
an error since rank 1 hasn't yet updated its _default_pg variable.
To resolve this issue, I've added barrier() at the end of both of these calls.
This ensures that once these calls return we are guaranteed about correct
initialization on all ranks.
Since these calls are usually done mostly during initialization, it should be
fine to add the overhead of a barrier() here.
#Closes: https://github.com/pytorch/pytorch/issues/40434, https://github.com/pytorch/pytorch/issues/40378
ghstack-source-id: 112923112
Test Plan:
Reproduced the failures in
https://github.com/pytorch/pytorch/issues/40434 and
https://github.com/pytorch/pytorch/issues/40378 and verified that this PR fixes
the issue.
Reviewed By: mrshenli
Differential Revision: D23858025
fbshipit-source-id: c4d5e46c2157981caf3ba1525dec5310dcbc1830
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44923
This ensures that RPC profiling works in single-threaded server
scenarios and that we won't make the assumption that we'll have multiple
threads when working on this code. For example, this assumption resulted in a
bug in the previous diff (which was fixed)
ghstack-source-id: 112868469
Test Plan: CI
Reviewed By: lw
Differential Revision: D23691304
fbshipit-source-id: b17d34ade823794cbe949b70a5ab35723d974203
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44664
Closes https://github.com/pytorch/pytorch/issues/39971. This PR adds support for functions decorated with `rpc.functions.async_execution` to be profiled over RPC as builtins, jit functions, and blocking python UDFs currently can be. The reasoning for this is to provide complete feature support in terms of RPC profiling and the various types of functions users can run.
To enable this, the PR below this enables calling `disableProfiler()` safely from another thread. We use that functionality to defer disabling the profiler on the server until the future corresponding to the RPC request completes (rather than only the blocking `processRPC` call as was done previously). Since when the future completes we've kicked off the async function and the future corresponding to it has completed, we are able to capture any RPCs the function would have called and the actual work done on the other node.
For example, if the following async function is ran on a server over RPC:
```
def slow_add(x, y):
time.sleep(1)
return torch.add(x, y)
rpc.functions.async_execution
def slow_async_add(to, x, y):
return rpc.rpc_async(to, slow_add, args=(x, y))
```
we expect to see the original RPC profiled, the nested RPC profiled, and the actual torch.add() work. All of these events should be recorded with the correct node id. Here is an example profiling output:
```
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Name Self CPU total % Self CPU total CPU total % CPU total CPU time avg Number of Calls Node ID
------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- --------------- rpc_async#slow_async_add(worker1 -> worker2) 0.00% 0.000us 0 1.012s
1.012s 1 1
aten::empty 7.02% 11.519us 7.02% 11.519us 11.519us 1 1
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3) 0.00% 0.000us 0 1.006s
1.006s 1 2 rpc_async#slow_async_add(worker1 -> worker2)#remote_op: aten::empty 7.21% 11.843us 7.21% 11.843us
11.843us 1 2
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::add 71.94% 118.107us 85.77% 140.802us 140.802us 1 3
rpc_async#slow_async_add(worker1 -> worker2)#remote_op: rpc_async#slow_add(worker2 -> worker3)#remote_op: aten::empty 13.82% 22.695us 13.82% 22.695us
22.695us 1 3 ------------------------------------------------------------------------------------------------------------------------- --------------- --------------- --------------- --------
------- --------------- --------------- ---------------
Self CPU time total: 164.164us
```
This PR also moves a bunch of the profiling logic to `rpc/utils.cpp` to declutter `request_callback` code.
ghstack-source-id: 112868470
Test Plan:
```
rvarm1@devbig978:fbcode (52dd34f6)$ buck test mode/no-gpu mode/dev-nosan //caffe2/test/distributed/rpc:process_group_agent -- test_rpc_profiling_async_function --print-passing-details --stress-runs 1
```
Reviewed By: mrshenli
Differential Revision: D23638387
fbshipit-source-id: eedb6d48173a4ecd41d70a9c64048920bd4807c4