Commit Graph

83 Commits

Author SHA1 Message Date
Pritam Damania
8b501dfd98 Fix memory leak in TensorPipeAgent. (#50564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50564

When an RPC was sent, the associated future was stored in two maps:
pendingResponseMessage_ and timeoutMap_. Once the response was received, the
entry was only removed from pendingResponseMessage_ and not timeoutMap_. The
pollTimedoudRpcs method then eventually removed the entry from timeoutMap_
after the time out duration had passed.

Although, in scenarios where there is a large timeout and a large number of
RPCs being used, it is very easy for the timeoutMap_ to grow without any
bounds. This was discovered in https://github.com/pytorch/pytorch/issues/50522.

To fix this issue, I've added some code to cleanup timeoutMap_ as well once we
receive a response.
ghstack-source-id: 119925182

Test Plan:
1) Unit test added.
2) Tested with repro in https://github.com/pytorch/pytorch/issues/50522

#Closes: https://github.com/pytorch/pytorch/issues/50522

Reviewed By: mrshenli

Differential Revision: D25919650

fbshipit-source-id: a0a42647e706d598fce2ca2c92963e540b9d9dbb
2021-01-18 16:34:28 -08:00
Shen Li
098751016e Completely Remove FutureMessage from RPC cpp tests (#50027)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50027

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25753815

Pulled By: mrshenli

fbshipit-source-id: 85b9b03fec52b4175288ac3a401285607744b451
2021-01-07 19:50:50 -08:00
Shen Li
008206decc Replace FutureMessage with ivalue::Future in RRefContext (#49960)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49960

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25730530

Pulled By: mrshenli

fbshipit-source-id: 5d54572c653592d79c40aed616266c87307a1ad8
2021-01-07 19:50:19 -08:00
Shen Li
25ef605132 Replace FutureMessage with ivalue::Future in distributed/autograd/utils.* (#49927)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/49927

Test Plan: Imported from OSS

Reviewed By: lw

Differential Revision: D25724241

Pulled By: mrshenli

fbshipit-source-id: d608e448f5224e41fbb0b5be6b9ac51a587f25b4
2021-01-07 19:50:16 -08:00
Wanchao Liang
553ccccc54 [c10d] switch ProcessGroup to be managed by intrusive_ptr (#47343)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47343

Test Plan: Imported from OSS

Reviewed By: gmagogsfm

Differential Revision: D24723418

Pulled By: wanchaol

fbshipit-source-id: 0463819b96c53b12bdbb3905431110d7b21beb77
2020-11-12 07:36:23 -08:00
Wanchao Liang
665ac2f7b0 [reland] [c10d] switch Store to be managed by intrusive_ptr (#47808)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47808

reland https://github.com/pytorch/pytorch/pull/47074

Test Plan: wait for ci

Reviewed By: gmagogsfm

Differential Revision: D24905246

fbshipit-source-id: edeb7e6e486570ce889f12512e9dc02061d6cc03
2020-11-11 22:53:20 -08:00
Wanchao Liang
1f946e942d Revert D24667128: [c10d] switch Store to be managed by intrusive_ptr
Test Plan: revert-hammer

Differential Revision:
D24667128 (0cfe3451d4)

Original commit changeset: 9b6024c31c85

fbshipit-source-id: d8ddf9eb2fccef5023e05698e0c4662708fe4945
2020-11-11 10:49:58 -08:00
Wanchao Liang
0cfe3451d4 [c10d] switch Store to be managed by intrusive_ptr (#47074)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47074

Test Plan: Imported from OSS

Reviewed By: mrshenli

Differential Revision: D24667128

Pulled By: wanchaol

fbshipit-source-id: 9b6024c31c851b7c3243540f460ae57323da523b
2020-11-10 23:36:44 -08:00
Pritam Damania
bf85642c4c Remove lock from GraphTask::set_exception_without_signal. (#45867)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45867

In most cases the lock ordering was hold a lock in local autograd and
then hold a lock in DistAutogradContext.

In case of `set_exception_without_signal` the lock order was in reverse and as
a result we saw potential deadlock issues in our TSAN tests. To fix this, I
removed the lock and instead just used std::atomic exchange.

In addition to this, I fixed TestE2E to ensure that we use the appropriate
timeout.

TestE2EProcessGroup was flaky for these two reasons and now is fixed.
ghstack-source-id: 113592709

Test Plan: waitforbuildbot.

Reviewed By: albanD

Differential Revision: D24120962

fbshipit-source-id: 12447b84ceae772b91e9a183c90d1e6340f44e66
2020-10-05 20:02:29 -07:00
Lucas Hosseini
ac8c7c4e9f Make Channel API accept buffer structs rather than raw pointers. (#45014)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45014

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/219

Pull Request resolved: https://github.com/pytorch/tensorpipe/pull/212

+ Introduce buffer.h defining the buffer struct(s). The `CpuBuffer`
struct is always defined, while the `CudaBuffer` struct is defined
only when `TENSORPIPE_SUPPORTS_CUDA` is true.
+ Update all channels to take a `CpuBuffer` or `CudaBuffer` for
`send`/`recv` rather than a raw pointer and a length.
+ Make the base `Channel`/`Context` classes templated on `TBuffer`,
effectively creating two channel hierarchies (one for CPU channels,
one for CUDA channels).
+ Update the Pipe and the generic channel tests to use the new API. So
far, generic channel tests are CPU only, and tests for the CUDA IPC
channel are (temporarily) disabled. A subsequent PR will take care of
refactoring tests so that generic tests work for CUDA channels. An
other PR will add support for CUDA tensors in the Pipe.

Differential Revision: D23598033

Test Plan: Imported from OSS

Reviewed By: lw

Pulled By: beauby

fbshipit-source-id: 1d6c3f91e288420858835cd5e7962e8da051b44b
2020-09-21 10:18:45 -07:00
Lucas Hosseini
af3fc9725d Extract rpc/tensorpipe_utils.{cpp,h} from rpc/utils.{cpp,h} (#44803)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44803

Test Plan: CI

Reviewed By: lw

Differential Revision: D23732022

fbshipit-source-id: 5b839c7997bbee162a14d03414ee32baabbc8ece
2020-09-18 13:51:43 -07:00
generatedunixname89002005287564@sandcastle1415.cln1.facebook.com
1dd658f28f [Codemod][GleanFbcode] Remove dead includes in caffe2/test (#43953)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43953

Reviewed By: malfet

Differential Revision: D23445556

fbshipit-source-id: 89cd6833aa06f35c5d3c99d698abb08cd61ae4ab
2020-09-01 21:48:28 -07:00
Shen Li
06aaf8c20d Add set_device_map to TensorPipeOptions to support GPU args (#42637)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42637

This commit enables sending non-CPU tensors through RPC using
TensorPipe backend. Users can configure device mappings by calling
set_map_location on `TensorPipeRpcBackendOptions`. Internally,
the `init_rpc` API verifies the correctness of device mappings. It
will shutdown RPC if the check failed, or proceed and pass global
mappings to `TensorPipeAgent` if the check was successful. For serde,
we added a device indices field to TensorPipe read and write buffers,
which should be either empty (all tensors must be on CPU) or match
the tensors in order and number in the RPC message. This commit
does not yet avoid zero-copy, the tensor is always moved to CPU
on the sender and then moved to the specified device on the receiver.

Test Plan: Imported from OSS

Reviewed By: izdeby

Differential Revision: D23011572

Pulled By: mrshenli

fbshipit-source-id: 62b617eed91237d4e9926bc8551db78b822a1187
2020-08-14 18:46:55 -07:00
Nikita Shulga
2f9fd8ad29 Build test_e2e_tensorpipe only if Gloo is enabled (#43041)
Summary:
test_e2e_tensorpipe depends on ProcessGroupGloo, therefore it could not be tested with Gloo disabled
Otherwise, it re-introduces  https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43041

Reviewed By: lw

Differential Revision: D23122101

Pulled By: malfet

fbshipit-source-id: a8a088b6522a3bc888238ede5c2d589b83c6ea94
2020-08-14 09:24:47 -07:00
Luca Wehrstedt
ed242cbec5 Guard TensorPipe agent by USE_TENSORPIPE (#42682)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42682

ghstack-source-id: 109834351

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22978717

fbshipit-source-id: 18b7cbdb532e78ff9259e82f0f92ad279124419d
2020-08-14 02:57:36 -07:00
Luca Wehrstedt
8493b0d5d6 Enroll TensorPipe agent in C++-only E2E test (#42680)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42680

ghstack-source-id: 109544678

Test Plan: CI

Reviewed By: mrshenli

Differential Revision: D22978714

fbshipit-source-id: 04d6d190c240c6ead9bd9f3b7f3a5f964d7451e8
2020-08-13 07:07:30 -07:00
Nikita Shulga
64a7939ee5 test_cpp_rpc: Build test_e2e_process_group.cpp only if USE_GLOO is true (#42836)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/42776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42836

Reviewed By: seemethere

Differential Revision: D23041274

Pulled By: malfet

fbshipit-source-id: 8605332701271bea6d9b3a52023f548c11d8916f
2020-08-10 16:54:26 -07:00
Luca Wehrstedt
c30bc6d4d7 Update TensorPipe submodule (#42522)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42522

Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.

There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header. So instead we link those targets to the tensorpipe target in order for them to pick up the correct include directories.

I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.

Test Plan: CI

Reviewed By: malfet

Differential Revision: D22959472

fbshipit-source-id: 1959a41c4a66ef78bf0f3bd5e3964969a2a1bf67
2020-08-06 02:14:58 -07:00
Edward Yang
352e15f1a2 Revert D22812445: Update TensorPipe submodule
Test Plan: revert-hammer

Differential Revision:
D22812445 (2335430086)

Original commit changeset: e6d824bb28f5

fbshipit-source-id: 606632a9aaf2513b5ac949e4d6687aa7563eae5d
2020-07-31 10:16:48 -07:00
Luca Wehrstedt
2335430086 Update TensorPipe submodule (#42225)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42225

Main changes:
- Consolidated CMake files to have a single entry point, rather than having a specialized one for PyTorch.
- Changed the way the preprocessor flags are provided, and changed their name.

There were a few instances in PyTorch's CMake files where we were directly adding TensorPipe's source directory as an include path, which however doesn't contain the auto-generated header we now added. We fix that by adding the `tensorpipe` CMake target as a dependency, so that the include paths defined by TensorPipe are used, which contain that auto-generated header.

I'm turning off SHM and CMA for now because they have never been covered by the CI. I'll enable them in a separate PR so that if they turn out to be flaky we can revert that change without reverting this one.

Test Plan: CircleCI is all green.

Reviewed By: beauby

Differential Revision: D22812445

fbshipit-source-id: e6d824bb28f5afe75fd765de0430968174f3531f
2020-07-30 02:32:52 -07:00
Pritam Damania
ff6e560301 Add C++ end to end test for RPC and distributed autograd. (#36893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36893

Adding an end to end test for running a simple training loop in C++
for the distributed RPC framework.

The goal of this change is to enable LeakSanitizer and potentially catch memory
leaks in the Future. Enabling LSAN with python multiprocessing is tricky and we
haven't found a solution for this. As a result, adding a C++ test that triggers
most of the critical codepaths would be good for now.

As an example, this unit test would've caught the memory leak fixed by:
https://github.com/pytorch/pytorch/pull/31030
ghstack-source-id: 107781167

Test Plan:
1) Verify the test catches memory leaks.
2) waitforbuildbot

Reviewed By: mrshenli

Differential Revision: D21112208

fbshipit-source-id: 4eb2a6b409253108f6b6e14352e593d250c7a64d
2020-07-15 12:59:19 -07:00
Luca Wehrstedt
72f2ff5950 [TensorPipe] Improve serialization (#39010)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39010

The initial version of the serialization for the TensorPipe RPC agent (i.e., the conversion from rpc::Message to tensorpipe::Message) worker around a limitation of TensorPipe of only allowing one payload per message by pickling each tensor separately and storing the pickles as metadata (which is a less efficient way of sending data over, as it goes through more copies). Having now lifter that limitation we can now improve the way we serialize. We now put the type and the id as their own payloads, we do a single pickling pass for all the tensors of the message (which allows us to deduplicate them) and store the pickle as a payload. My impression is that pickling is a somewhat costly operation, so reducing the number of times we do it should be beneficial for performance. For this same reason, another change I've done here is separate the allocation of the buffers from the deserialization. This will allow us (in the future) to perform the allocation on the I/O event loop but perform the unpickling in the worker thread, thus keeping the event loop more responsive.
ghstack-source-id: 104810740

Test Plan: RPC tests

Differential Revision: D21716067

fbshipit-source-id: c1475cc78afdcf0820a485ffd98c91abb35796c7
2020-05-28 10:48:24 -07:00
Luca Wehrstedt
bc09478a60 [TensorPipe] Use the new multi-payload message API (#37919)
Summary:
In D21209901 TensorPipe added support for a vector of payloads inside each message, instead of a single one, so that users with multiple payloads can send them separately as they are instead of having to copy them into a new block of contiguous memory. The PyTorch agent is using the old API, which is preventing us from deleting it. This change has no effects on over-the-wire format and thus on performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/37919

ghstack-source-id: 103572164

Test Plan:
On both workers
```
import os
import torch
import torch.distributed.rpc as rpc
os.environ["MASTER_ADDR"] = "127.0.0.1"
os.environ["MASTER_PORT"] = "8765"
```
On worker 0
```
rpc.init_rpc(name="foo", rank=0, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(worker_name_to_id={"foo": 0, "bar": 0}))
```
On worker 1
```
rpc.init_rpc(name="bar", rank=1, backend=rpc.backend_registry.BackendType.TENSORPIPE, world_size=2, rpc_backend_options=rpc.TensorPipeRpcBackendOptions(worker_name_to_id={"foo": 0, "bar": 0}))
```
On worker 0
```
In [15]: rpc.rpc_sync("bar", torch.add, args=(torch.full((2,2), 1), torch.full((2,2), 2)))
Out[15]:
tensor([[3., 3.],
        [3., 3.]])

In [16]: rpc.rpc_sync("bar", torch.add, args=(1, 2))
Out[16]: 3
```

Differential Revision: D21425536

fbshipit-source-id: a0ec2be825556b39aff018a2834baf815a6d8fa5
2020-05-07 02:52:30 -07:00
Edward Yang
fe88806784 Back out "Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count" (#37893)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37893

Original commit changeset: 50746043acf3

Test Plan: sandcastle and ossci

Reviewed By: malfet, seemethere, ngimel

Differential Revision: D21416509

fbshipit-source-id: 735ec4e61f9d36d4537f52dd2dc6267751aeb94b
2020-05-05 22:43:15 -07:00
Edward Yang
a2fc7f787a Revert D21171334: [pytorch][PR] Change StorageImpl to track byte count rather than element count
Test Plan: revert-hammer

Differential Revision:
D21171334

Original commit changeset: 37329a379de9

fbshipit-source-id: 50746043acf3c76754688de0fe6f1cc12437ea2f
2020-05-05 16:36:15 -07:00
Kurt Mohler
3706803b60 Change StorageImpl to track byte count rather than element count (#37776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37776

* Remove type-specific size tracking in favor of byte size tracking in Storage and StorageImpl
* Changed numel() and set_numel() to nbytes() and set_nbytes()
* Added enum argument to Storage/StorageImpl constructor to indicate new meaning of the size parameter
* Update all callers of the changed API

Part of issue https://github.com/pytorch/pytorch/issues/33950
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37028

Differential Revision: D21171334

Pulled By: ezyang

fbshipit-source-id: 37329a379de9a3a83cc5e9007e455a3e1c2d10b8
2020-05-05 14:20:51 -07:00
Hongyi Jia
3411ec6e32 [TensorPipe/RPC] Serialize and deserialize message (#36197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36197

Create APIs to convert between rpc::message and tensorpipe::message
1. tensorpipeSerialize() - converts rpc::message to tensorpipe::message without memory copy (tensors).
2. tensorpipeAllocateMessage - allocates rpc::message based on received tensorpipe descriptor to prepare memory-copy-free receiving.

Test Plan: buck test caffe2/test/cpp/rpc:test_tensorpipe_serialization

Reviewed By: lw

Differential Revision: D20084125

fbshipit-source-id: ffbc310f93443e50261aed752be0fe176610dd2a
2020-05-05 05:45:57 -07:00
Jeremy Lilley
443fe7ca0e [rpc] Avoid wireDeserializer overreading buffers by 1 byte (#36976)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36976

The bounds check and the read were swapped in two places - I noticed
ASAN complaining in an unrelated change on an erroneous buffer.

Adding a couple simple test cases.
ghstack-source-id: 102606986

Test Plan: buck test mode/dev caffe2/test/cpp/rpc:

Differential Revision: D21148936

fbshipit-source-id: 7ec5007535f7310437ac1b9a72852a223b9dd29a
2020-04-21 17:01:45 -07:00
Nikita Shulga
b9adbb5002 Fix/relax CMake linter rules (#35574)
Summary:
Ignore mixed upper-case/lower-case style for now
Fix space between function and its arguments violation
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35574

Test Plan: CI

Differential Revision: D20712969

Pulled By: malfet

fbshipit-source-id: 0012d430aed916b4518599a0b535e82d15721f78
2020-03-27 16:52:33 -07:00
Jeremy Lilley
fff6fe83a7 [pytorch-rpc] WireSerializer should check has_storage() (#34626)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/34626

We need to check has_storage() before looking at it in
cloneSparseTensors(), to avoid gratuitously throwing.

Ideally, we'd add a test for this (I wrote one up but had to disable it),
but won't work until JIT Pickler supports sparse tensors.
ghstack-source-id: 100018077

Test Plan: buck test mode/dev-nosan caffe2/torch/fb/distributed/thriftRpcAgent/...

Differential Revision: D20399971

fbshipit-source-id: 5debfa8140eb1f949d37336330223962cc320abc
2020-03-12 11:35:21 -07:00
generatedunixname89002005287564
9482683065 Remove dead includes in caffe2/test
Reviewed By: ezyang

Differential Revision: D19273220

fbshipit-source-id: 3dfc3388914e60611c84472e3fc529f5b5e40534
2020-01-21 11:30:34 -08:00
Jeremy Lilley
dff7b945bf Avoid sending large unneeded data over wire in process_group_agent. (#31357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/31357

If a user selects a subset of a Tensor and sends it in an RPC, we were sending
the whole original Tensor Storage over the network.

While this sounds reasonable, in practice, we observed view-like Tensors being sent
over rpc, where only 1% of the data in the provided Tensor's Storage was
actually used/needed.

The simple solution here is to just force a clone in the serializer code if we see that
less than (arbitrary) half the bits are used, and the tensor is more than a nominal few KB.
Add related tests to ensure this doesn't break.

An alternate approach would be to modify the Pickler. That said, since Pickler is shared by more
components, the logic might be harder to tailor appropriately at that layer (particularly
given that the Pickler has explicit logic to share a single Storage* among several Tensors
that commonly point to the same Storage*).

It's possible that we might want to further refine the basic thresholds in this change.
In practice, we've seen a mostly bimodal distribution thus far for the percent of Tensor
Storage referred by a Tensor in observed rpcs (i.e. either 90%+ or sub-10% of the Storage
referenced), hence the existing 50% threshold here is probably not an unreasonable
starting point.
ghstack-source-id: 95925474

Test Plan: buck test mode/dev caffe2/test/cpp/rpc/...

Differential Revision: D19137056

fbshipit-source-id: e2b3a4dd0cc6e1de820fd0740aa1d59883dbf8d4
2019-12-18 19:24:24 -08:00
Jeremy Lilley
f4e7e9039d Improve process_group_agent() serialization speed (#29785)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/29785

TLDR: This change improves process_group's serialization speed:
  Serialize_Tensor64:     12.38us ->   1.99us  (~-84%)
  Deserialize_Tensor64:   33.89us ->   5.62us  (~-84%)
  Serialize_Tensor1M:    525.74us -> 285.43us  (~-45%)
  Deserialize_Tensor1M:  892.61us -> 273.68us  (~-70%)

After speaking with the jit team, we had consensus that torch::save()/load()
are somewhat high-overhead for RPC serialization, mostly intended for
persistent disk data.

(Particularly, for large tensors, 35% of the time is spent in CRC checking, even
with the fb-side changes to subsitute 40x faster SSE-accelerated crc checking;
Also, for small tensors, the zip container overhead is considerable, as is the
overhead of lexing/parsing an embedded text python program for each RPC).

The jit team encouraged us to use jit::pickler, with the WriteableTensorData
way of outputting result tensors (not the default side-tensor table, or
with pickling the actual tensors). This ends up just pickling some tensor
metadata, and giving us some tensor blobs that we can mindlessly
blit over the wire (they copy to cpu memory if needed).

There is yet no standardized container format for the pickled data
(there is jit::pickle_save() checked in, but but it's experimental,
no load function is yet provided), but they encouraged us to just use
something sensible for this, and possibly revisit later. For now, I made
the directory headers slightly http-inspired.

Note that serialization is just one component of the pipeline, but that
said, we also see reasonable reductions in end-to-end echo times (noisier):
   ProcessGroupAgent_Echo(Tensor_Small)   855.25us -> 492.65us  (~-42%)
   ProcessGroupAgent_Echo(Tensor_1M)       10.82ms -> 6.94ms    (~-35%)
   ProcessGroupAgent_Echo(Small_NoTensor) 688.82us -> 301.72us  (~-56%)
   ProcessGroupAgent_Echo(1MB_NoTensor)     4.65ms -> 3.71ms    (~-20%)

I moved the "wire serialization" logic to a separate file to assist with
unittesting.
ghstack-source-id: 94694682

Test Plan:
buck test mode/dev-nosan caffe2/test/cpp/api:serialize
  buck test mode/dev-nosan caffe2/test/...

Differential Revision: D18493938

fbshipit-source-id: 07ddfe87dbe56472bc944f7d070627052c94a8f4
2019-11-28 09:57:52 -08:00