Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60804
The lowerings are stored as a map c10::Symbol -> std::function and the
signature of thoese functions match the signature of
`computeOperandValue`. Custom lowerings have higher priority over the
standard ones, i.e. we can redefine how a given op is lowered.
In general this feature is aimed at unblocking users whose models
contain ops that are not yet supported by NNC - it allows to quickly add
a custom lowering for a given op.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D29409580
Pulled By: ZolotukhinM
fbshipit-source-id: e8e8dc9d3cb9155cfbf5c08a4216ba1b5b791a60
Summary:
Previously in the PR: https://github.com/pytorch/pytorch/issues/58968 we added RAdam to Optimizers. Here in this PR we are proposing multi-tensor version of RAdam for PyTorch.
Radam has been proposed in the paper https://arxiv.org/pdf/1908.03265.pdf Liyuan Liu et al.
It has been one of the most used algorithm in Deep Learning community.
Differing from the paper, we selected variance tractability cut-off as 5 instead of 4 as it is the common practice.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59161
Reviewed By: vincentqb
Differential Revision: D29360576
Pulled By: iramazanli
fbshipit-source-id: 7ccdbf12b1ee7f12e66f7d7992123a70cc818b6b
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669
Test Plan: Added unit test to check for nested outputs.
Reviewed By: ajyu
Differential Revision: D29322025
fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724
Summary:
This argument is only important for speed and memory usage. So it is ok to ignore it during the backward.
As discussed, we might want to change this to speed up backward in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60673
Reviewed By: soulitzer
Differential Revision: D29370125
Pulled By: albanD
fbshipit-source-id: ad50b3ea530aeb194f5a51845523b517a50f2c71
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60656
This PR uses `torch.testing.get_all_dtypes()` for dtype parametrisation
of tests in `test_sparse_csr.py`. It adds previously excluded from tests
bool, half, bfloat16, complex dtypes. `torch.complex32` is omitted due
to lack of coverage and lack of specialized `AT_DISPATCH...`.
The process of adding more dtypes to tests releaved that `.to_dense()`
doesn't work for all dtypes.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D29408058
Pulled By: cpuhrsch
fbshipit-source-id: 319b6f51b9786d6957d508f51657657a6d00267a
Summary: it seems to be accidentally missing
Test Plan: run CI
Reviewed By: suo
Differential Revision: D29335990
fbshipit-source-id: 2790bc10d141f9484a0807ff7800024a02fd9cfa
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58059
Add CUDA.used vital sign which is true only if CUDA was "used" which technically means the context was created.
Also adds the following features:
- Force vitals to be written even if vitals are disabled, to enable testing when the env variable is not set from the start of execution
- Add a read_vitals call for python to read existing vital signs.
Test Plan: buck test mode/dbg caffe2/test:torch -- --regex basic_vitals
Reviewed By: xuzhao9
Differential Revision: D28357615
fbshipit-source-id: 681bf9ef63cb1458df9f1c241d301a3ddf1e5252
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58065
This PR replaces the existing code-generated CPU fallback kernels that XLA uses with a single boxed CPU fallback.
Current state: there are a couple different design ideas that I want to point out, but the logic for the actually kernel is mostly done and passing tests.
### Design
To preface, I'm not 100% tied to the current design and I'm putting the PR up now for opinions and totally open to alternatives, some of which I listed below. Actually after writing this description, I'm leaning toward the following changes:
* Confirm whether or not we can remove all C++ logging info directly in the yaml.
**Current Design**
All of the CPU fallback codegen is deleted. In its place, XLA (and other external backends, later) can choose to opt into a CPU fallback by adding the following code in a C++ file. I have an corresponding [xla-side PR with the xla changes](https://github.com/pytorch/xla/pull/2945/files#diff-1a005c10039f0cb11130a3b740f5de716d2f10acaea121017016025861886798R1).
There's no actual requirement to split up the code into a .h and .cpp file, but that's necessary in the XLA case because they sometimes need to call the fallback directly from their handcrafted kernels.
```
// xla_cpu_fallback.h
#include <ATen/native/CPUFallback.h>
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack);
...
```
```
// xla_cpu_fallback.cpp
#include "torch_xla/csrc/aten_cpu_fallback.h"
...
void xla_cpu_fallback(const c10::OperatorHandle& op, torch::jit::Stack* stack) {
// Do custom logging here
...
// Call the actual boxed CPU fallback.
at::native::cpu_fallback(op, stack);
}
TORCH_LIBRARY_IMPL(_, XLA, m) {
m.fallback(torch::CppFunction::makeFromBoxedFunction<&xla_cpu_fallback>());
}
```
Now that the fallback is exposed in the backend, they can call it directly. Doing so requires converting from an unboxed to a boxed context, which we provide a utility function before. E.g.:
```
#include <ATen/native/CPUFallback.h>
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
....
if (...call_fallback...) {
return at::native::call_fallback_fn<&xla_cpu_fallback, decltype(at::addmm)>::call("aten::addmm", self, mat1, mat2, beta, alpha);
}
...
}
```
That `decltype(at::addmm)` logic isn't actually used everywhere in the xla-side PR yet, since you hit issues with overloads. I could use it everywhere once #58092 lands.
**Alternatives: The API for calling the CPU fallback directly is ugly, can we make it nicer?**
We could change the api to use `at::redispatch`, which would make it look something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
....
if (...call_fallback...) {
return at::redispatch::addmm(c10::DispatchKeySet(c10::DispatchKey::CPUFallback), self, mat1, mat2, beta, alpha);
}
...
}
```
Which definitely feels cleaner, but also requires adding a new DispatchKey just for this use case. Conditionally calling the CPU fallback doesn't sound like a hugely important use case, so I don't know if giving up one of our 64 dispatch key slots is worth the API improvement. Totally open to other opinions though!
Another more mild improvement that would avoid having to pass operator string names (including overloads) around would be to codegen (yet another) namespaced API. Something like this:
```
at::Tensor addmm(const at::Tensor& self,const at::Tensor& mat1,const at::Tensor& mat2,const at::Scalar& beta,const at::Scalar& alpha) {
....
if (...call_fallback...) {
return at::fallback::addmm<&xla_cpu_fallback>(self, mat1, mat2, beta, alpha);
}
...
}
```
Writing that out actually I actually like it more (I think it'll let us get rid of `decltype(...)`). Maybe that is nice enough to warrant a new codegen API - I haven't tried adding that yet, but if people like it I'm happy to try it out.
**More alternatives**
The current design also involves the backend manually writing and registering the boxed fallback themselves, but an alternative would be for us to do it in codegen too: they would just need to pass in all of the C++ logging that they want done in the fallback, directly through the yaml. The main downsides:
* Backend code that wants to call the fallback needs to abide by whatever convention our codegen uses to name the generated boxed fallback.
* Passing custom C++ logging through yaml is just more fragile: right now xla uses an `iostream` to log each tensor arg in the operator, so we'd have to either force other backends into the same convention or figure something else out later.
To be fair, we actually already do that: XLA has custom per-tensor-arg logging for all of the generated `out` wrappers in the codegen, which we do by passing their C++ logging info through the yaml. This seems unnecessary though, since `out` wrappers just call into a functional kernel, which is hand written with its own custom logging. So my take is: try to remove custom C++ logging from the yaml, and if it turns out to be really necessary, then we may as well take advantage of that to codegen the fallback.
### Performance impact
While ops that fall back to CPU aren't exactly hot path, we probably don't want to use a boxed fallback if it turns out to be an absolute perf killer.
I ran my benchmarks using callgrind, benchmarking both `at::add` and `at::add_out` run on XLA. My callgrind benchmark for `at::add` can be found here (the add_out benchmark looks basically the same): https://www.internalfb.com/phabricator/paste/view/P415418587. I created the benchmark by hacking the existing xla C++ test build scripts and throwing in a reference to callgrind.
I also attached the full callgrind output for each benchmark; the full output is actually pretty noise and hard to parse, but I focused on everything underneath the `at::add()` call in the output, which was much more stable. My guess is that it's due to some heavyweight async startup processing that xla does.
`at::add`:
before: 88,505,130 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421001
after: 102,185,654 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415421273
delta: ~15.5% increase
`at::add_out`:
before: 63,897,395 instructions. Full output: https://www.internalfb.com/intern/everpaste/?handle=GBrrKwtAPlix9wUEAOZtrFXpdO5UbsIXAAAz
after: 73,170,346 instructions. Full output: https://www.internalfb.com/phabricator/paste/view/P415423227
delta: ~14.5% increase
High level takeaway: A framework overhead increase of 10-20% doesn't seem too horrible for the CPU fallback use case.
For structured, functional ops that requires a CPU fallback, we're actually in an unfortunate situation: we're doing even more work than necessary. Our codegen automatically creates a `CompositeExplicitAutograd` kernel which calls into the `out` operator. So the extra work that we end up doing is:
* An extra dispatcher hop: (at::add -> CompositeExplicitAutograd -> CPUFallback -> at::native::add) instead of (at::add -> CPUFallback -> at::native::add)
* An unnecessary tensor allocation (the CompositeExplicitAutograd kernel uses at::empty() to create an output tensor, which is immediately overwritten by the CPU fallback)
* An unnecessary meta() call (the CompositeExplicitAutograd kernel calls it to create the output tensor, but we call it again in the CPU kernel).
* unboxing->boxing->unboxing logic (this is the only strictly required piece)
There are definitely ways to avoid the unnecessary work explained above: one would be to give the boxed fallback higher priority than composite keys (there's [an issue for it here](https://github.com/pytorch/pytorch/issues/55104)), and codegen fallthroughs for all composite ops. It'll require more infra to set up, so I see it as more of a perf knob that we can apply if we need it later.
Unfortunately I couldn't dig much deeper into the differences aside from the aggregate change in instructions, since it looks like callgrind fudged some of the instruction attribution (`at::to_cpu` takes up a ton of instructions, but I don't see any attribution for the `at::native::add` kernel anywhere).
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D28833085
Pulled By: bdhirsh
fbshipit-source-id: 537ebd5d7fb5858f1158764ff47132d503c3b92b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60517
This is to fix the module support on lazymodulefixin on the bug issue #60132
Check the link: https://github.com/pytorch/pytorch/issues/60132
We will have to update lazy_extension given the dependency on module.py and update the unit test as well.
Test Plan:
Unit test passes
torchrec test passes
Reviewed By: albanD
Differential Revision: D29274068
fbshipit-source-id: 1c20f7f0556e08dc1941457ed20c290868346980
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60631
Per #48360, speed up `Transformer.generate_square_subsequent_mask`. New impl is informally ~5x faster, though absolute difference is probably small.
PR includes Python and C++ versions as well as a couple of places where the previous impl had been copied around.
Test Plan: Imported from OSS
Reviewed By: jbschlosser, albanD
Differential Revision: D29356673
Pulled By: bhosmer
fbshipit-source-id: 4c062ba0ead61a445aeef451c78777bf0b3a631e
Summary:
`merge` is the directory with the actual changes, not `master`. Verified by downloading arficats from https://github.com/pytorch/pytorch/pull/60777/checks and searching through the result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60792
Reviewed By: walterddr
Differential Revision: D29405288
Pulled By: driazati
fbshipit-source-id: 419c943727c00429945c1f116645bfa22fb12456
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60593
Per #55270, this PR makes it configurable whether to run LayerNorm before or after other operations in Transformer layers.
However, it leaves for a separate PR the removal of the LayerNorm performed after the final encoder/decoder layer has run, which is redundant when LayerNorms has been run after other in-layer operations (problem described in #24930#50086#51447).
Note: this means that transformers built with `nn.Transformer()` are now configurable, but will still contain a redundant LayerNorm when configured as before. However, callers of the `TransformerEncoder` and `TransformerDecoder` classes have always been able to avoid this redundancy.
Reviewer notes:
1. Ran across this during other work, don't know if anybody's working on it already (most recent conversation in issues seems to be from early April). Happy to abandon if so.
2. Was looking for a quick way to add tests but it looks like the existing ones in test_nn just compare against snapshots. I could add something similar, but curious if there's any prepackaged way to add a test that LayerNorm-first (the new option) yields model that trains properly, etc.
3. New code in the `forward`s was written to minimize diff churn rather than maximize beauty :P happy to pretty it up if desired.
Test Plan: Imported from OSS
Reviewed By: jbschlosser
Differential Revision: D29356590
Pulled By: bhosmer
fbshipit-source-id: 308669326990b8923aab5fcd96e03b582fb21f24
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60782
PR #60473 introduced a new folders nesting level, this change updates
clang_format_utils.py to accordingly adjust the way it sets up root
path.
Test Plan: Imported from OSS
Reviewed By: zhxchen17
Differential Revision: D29403622
Pulled By: ZolotukhinM
fbshipit-source-id: 6404271615c2d263834cf538ab0153c4d41cc5c3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60692
Update make_cifar_db.cc to work with the DB API changes in D29204425 (00896cb9ed).
Test Plan: buck build caffe2/binaries:make_cifar_db
Differential Revision: D29374754
fbshipit-source-id: 23d2acd24031d11071791e398433b537215ffd38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60711
We already build the docs on each PR, this adds a step to push the relevant folder of the docs (we build the entire website for pytorch.github.io which clocks in at around 500 MB, but we really only need the "master" docs, not every version. The master docs by themselves are around 50 MB which is more reasonable). It uses the same S3 bucket as the artifacts but places the items at the `pytorch/pytorch/pr-previews/<pr number>` prefix. The bucket has a rule to expire resources in that prefix after 1 month.
On the AWS side the bucket has static hosting enabled with CloudFront directing to the docs preview prefix, so you can see the output at `https://d28slxzaq48q8t.cloudfront.net/<pr number>/`, e.g. https://d28slxzaq48q8t.cloudfront.net/60711/. For advertising we could link this on the HUD PR page as well as in the Dr. CI comment. We could add a CNAME on CloudFront to make this be `pr-preview.pytorch.org/<pr number>` or something but having random PRs be able to host content on the pytorch.org domain seems sketchy.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D29398818
Pulled By: driazati
fbshipit-source-id: 24032854d83815853b3650d8e54f60b684707f76
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59760
See https://github.com/pytorch/pytorch/issues/59049
There are some moving parts to this PR, I'll structure this explanation so the straightforward parts go first, and then the less straightforward parts.
**The actual dispatch to Python.** The core logic of dispatch to Python lives in `concrete_dispatch_fn` in `torch/csrc/autograd/python_variable.cpp`. It takes the input IValue stack, scans all the arguments for Tensor arguments, and defers most of the heavy lifting to `handle_torch_function_no_python_arg_parser` which actually does all of the logic for calling out to torch dispatch (in particular, this function handles multiple dispatch situations for you). Because we have a different function name than regular `__torch_function__` handling, `handle_torch_function_no_python_arg_parser` is generalized to accept a magic method name to look for when testing if Tensors have custom handling or not. Unlike `__torch_function__`, by default there is no `__torch_dispatch__` on Tensor classes.
**Maintaining the Python dispatch key.** In order to get to the dispatch to Python logic, we must tag Tensors with the `__torch_dispatch__` magic method with the newly added Python dispatch key (separated from PythonFuncTorch to allow for a transitional period while they migrate to this mechanism). We expose a new private property `_is_python_dispatch` that assists in debugging if a Tensor is participating in Python dispatch or not. We apply the Python dispatch key the first time a PyObject for a Tensor is constructed (THPVariable_NewWithVar), testing if `__torch_dispatch__` exists with then newly added `check_has_torch_dispatch`.
**Shallow copy and detach.** For the simple examples tested in this PR, most creations of Tensor route through the dispatcher. The exception to this is `shallow_copy_and_detach`, which bypasses the dispatcher and is used when saving tensors for backwards. When a Tensor is Python dispatch, we override the behavior of `shallow_copy_and_detach` to instead directly call into `__torch_dispatch__` to perform a `detach` operation (in the same way it would be invoked if you called `detach` directly). Because this Python call is triggered directly from c10::TensorImpl, it must be indirected through `PyInterpreter::detach`, which is the general mechanism for dynamic dispatching to the Python interpreter associated with a TensorImpl.
**torchdeploy compatibility.** The dispatch to Python logic cannot be directly registered to the dispatcher as it is compiled in the Python library, which will get loaded multiple times per torchdeploy interpreter. Thus, we must employ a two phase process. First, we register a fallback inside a non-Python library (aten/src/ATen/core/PythonFallbackKernel.cpp). Its job is to determine the appropriate PyInterpreter to handle the Python dispatch by going through all of the arguments and finding the first argument that has a PyObject/PyInterpreter. With this PyInterpreter, it makes another dynamic dispatch via "dispatch" which will go to the correct torchdeploy interpreter to handle dispatching to actual Python.
**Testing.** We provide a simple example of a LoggingTensor for testing, which can be used to generate TorchScript-like traces to observe what operations are being called when a Tensor is invoked. Although a LoggingTensor would be better implemented via an is-a relationship rather than a has-a relationship (as is done in the test), we've done it this way to show that arbitrarily complex compositions of tensors inside a tensor work properly.
**Known limitations.**
* We haven't adjusted any operator code, so some patterns may not work (as they lose the Python subclass in an unrecoverable way)
* `__torch_function__` must be explicitly disabled with `_disabled_torch_function_impl` otherwise things don't work quite correctly (in particular, what is being disabled is default subclass preservation behavior.)
* We don't ever populate kwargs, even when an argument is kwarg-only
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Differential Revision:
D29017912
D29017912
Test Plan: Imported from OSS
Reviewed By: bdhirsh
Pulled By: ezyang
fbshipit-source-id: a67714d9e541d09203a8cfc85345b8967db86238
Summary:
This PR make `tools/clang_tidy.py` use python 3.6 APIs for `asyncio` and `shlex`.
I ran into some issues when running this script with the `-j` flag inside of the clang-tidy docker image (which uses python 3.6). Specifically, the functions `asycnio.run` and `shlex.join` are available in python >= 3.8.
This change does not affect CI because we do not run the clang-tidy job in parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60659
Reviewed By: albanD
Differential Revision: D29377851
Pulled By: 1ntEgr8
fbshipit-source-id: 92ab7ee6782b78d40ffccd03f1718ede4204d948
Summary:
Currently foreach `addcmul` and `addcdiv` cast scalar to float so that actual math is done in FP32 when tensor dtype is Float16/BFloat16 while regular `addcmul` and `addcdiv`, not.
### Reproducible steps to see the behavioral difference
```ipython
In [1]: import torch; torch.__version__
Out[1]: '1.9.0'
In [2]: a, b, c = torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([60000.0], device='cuda', dtype=torch.half), torch.tensor([-1.0], device='cuda', dtype=torch.half)
In [4]: torch.addcmul(a, b, c, value=2)
Out[4]: tensor([-inf], device='cuda:0', dtype=torch.float16)
In [5]: torch._foreach_addcmul([a], [b], [c], value=2)[0]
Out[5]: tensor([-60000.], device='cuda:0', dtype=torch.float16)
```
### How foreach casts?
Foreach addcmul and addcdiv cast scalar to `opmath_t` (almost equivalent to acc_type) here: 42c8439b6e/aten/src/ATen/native/cuda/ForeachPointwiseOp.cu (L30) and cast inputs and results here:
42c8439b6e/aten/src/ATen/native/cuda/ForeachFunctors.cuh (L133-L135)
Related to https://github.com/pytorch/pytorch/issues/58833#60227https://github.com/pytorch/pytorch/issues/60454
cc ptrblck mcarilli ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60715
Reviewed By: albanD
Differential Revision: D29385715
Pulled By: ngimel
fbshipit-source-id: 8bb2db19ab66fc99d686de056a6ee60f9f71d603
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60662
Fixes this flaky test. Basically, sometimes a rank can exit the test
early before rank 0 calls into allreduce. In this case Gloo will throw
connection reset error on all other ranks.
ghstack-source-id: 132363151
Test Plan: CI
Reviewed By: zhaojuanmao
Differential Revision: D29364806
fbshipit-source-id: ce0c292a2166edad57ea0dbb76df12cfd560a10d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60507
Fix incorrect documentation about the dtype for `torch.randint` described in issue #56347
Test Plan: Review documentation to make sure formatting is right
Reviewed By: bdhirsh
Differential Revision: D29321181
fbshipit-source-id: caae69a9bbb30052da518a3f5d22a7ed3504cdd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59987
Similar as GroupNorm, improve numerical stability of LayerNorm by Welford algorithm and pairwise sum.
Test Plan: buck test mode/dev-nosan //caffe2/test:nn -- "LayerNorm"
Reviewed By: ngimel
Differential Revision: D29115235
fbshipit-source-id: 5183346c3c535f809ec7d98b8bdf6d8914bfe790
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60370
When creating a single parition skip the output nodes, but process possible nodes after it.
Test Plan: Run all CI tests.
Reviewed By: jfix71
Differential Revision: D29265278
fbshipit-source-id: 2242009973a54498d8027cce5a294558a1206fdf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60147
Remove aten::to from allow_list now that the aten::to schema change has landed (D29121620 (eda2ddb5b0)).
Test Plan: CI
Reviewed By: iseeyuan
Differential Revision: D29187314
fbshipit-source-id: abdb5a560287a861f3858732f7b3da342ee4aa55
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60241
We're going to make a forward-incompatible change to this serialization
format soon, so I'm taking the opportunity to do a little cleanup.
- Use int for version. This was apparently not possible when V2
was introduced, but it works fine now as long as we use int64_t.
(Note that the 64-bits are only used in memory. The serializer will
use 1 byte for small non-negative ints.)
- Remove the "packed params" tensor and replace it with a list of ints.
- Replace the "transpose" field with "flags" to allow more binary flags
to be packed in.
- Unify required and optional tensors. I just made them all optional
and added an explicit assertion for the one we require.
A bit of a hack: I added an always-absent tensor to the front of the
tensor list. Without this, when passing unpacked params from Python to
the ONNX JIT pass, they type would be inferred to `List[Tensor]` if all
tensors were present, making it impossible to cast to
`std::vector<c10::optional<at:Tensor>>` without jumping through hoops.
The plan is to ship this, along with another diff that adds a flag to
indicate numerical requirements, wait a few weeks for an FC grace
period, then flip the serialization version.
Test Plan: CI. BC tests.
Reviewed By: vkuzo, dhruvbird
Differential Revision: D29349782
Pulled By: dreiss
fbshipit-source-id: cfef5d006e940ac1b8e09dc5b4c5ecf906de8716