Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58191
There are two clamp overloads: clamp.Scalar and clamp.Tensor. SR needs to support both or has checks in place to avoid runtime errors. Supporting both is not too hard so here we are.
Reviewed By: edvgha
Differential Revision: D28371949
fbshipit-source-id: 0ec6b8a0b8c6277e50d8e51e4e7a45aa62211e22
Summary:
Port addmm to structure kernel
Follow ups
- migrate `mm` and `addbmm` to structure
- move TORCH_CHECKS currently in `addmm_cpu_impl_` and `addmm_out_cuda_impl` to meta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57417
Reviewed By: bdhirsh
Differential Revision: D28291001
Pulled By: walterddr
fbshipit-source-id: 4eafaa30a465e225fbb4d2a69a36f1e037df9122
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58067
- Use expect_contiguous in layer_norm to avoid unnecessary refcount bumps when the tensors are contiguous
- Clean up some leftovers from the hacky wrappers removal cleanup: use c10::MaybeOwned<Tensor> for bias tensors
- Skip dispatcher for at::empty in the layer_norm impl in Static Runtime
Test Plan: CI
Reviewed By: swolchok
Differential Revision: D28214298
fbshipit-source-id: 73150fa62d5c18f41a2264f8e56bbe5e377ad045
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58100
aten::clone has a second arg, memory_format, which was not previously supported.
Reviewed By: ajyu
Differential Revision: D28347171
fbshipit-source-id: e083cc24c3228048429bba3497326415bc3d1f5a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58018
- Add checks for the number of input args and return nullptr if it doesn't match. This is intended to make Static Runtime more robust so that op schema change is less likely to break things. Imagine that a new arg is added to an op or a new overload is added that has this added arg, SR would simply ignore this added arg. If this arg has a default value, SR would run the model with the default value and give you wrong results, which can be hard to track down.
Reviewed By: ajyu
Differential Revision: D28047955
fbshipit-source-id: 01067059edd5cfea80c4ee121829f7733b11f601
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57578
The original impl in SR assumes that eps is a constant, which is true most of the times. However it could be a graph input as well. This diff fixes this issue. Unit tests are added as well.
Reviewed By: edvgha
Differential Revision: D28207975
fbshipit-source-id: 9a10dec159f3804e43ef74aaa20c3ec6c79548c9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57553
Relanding #57329 (the entire stack) which was reverted because I forgot
to guard a new test with `ifdef LLVM`.
Test Plan: Imported from OSS
Reviewed By: bertmaher
Differential Revision: D28195048
Pulled By: ZolotukhinM
fbshipit-source-id: 50052a2f20f84940b83d1dd1241c8659ff06e014
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521
When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.
This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:
```
if (node->kind() != prim::ListConstruct &&
node->kind() != prim::TupleConstruct &&
node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
const Operator& op = node->getOperator();
TORCH_CHECK(op.hasOperation());
op_ = op.getOperation(node);
VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
}
```
The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.
(Note: this ignores all push blocking failures!)
Reviewed By: ajyu
Differential Revision: D27531447
fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57282
Added support for fb::expand_dims for SR.
Test Plan:
buck test caffe2/torch/fb/sparsenn:gpu_test -- test_expand_dims
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
Reviewed By: hlu1
Differential Revision: D28043049
fbshipit-source-id: 01f59db7b507f027b220f044d6ff23602adbdb06
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56444
Added out version for layer_norm
Test Plan:
buck test caffe2/aten:math_kernel_test -- NativeLayerNorm
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: hlu1
Differential Revision: D27873846
fbshipit-source-id: 53ee9fec4ff9a4e78198b031e86b5afd013626dd
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56841
- Move arg checks to outside the lambda so we can perform these checks at Static Runtime initialization time
- use `optional` where possible
- support `to.other` overload, the 5-arg input load of `torch.to`.
Test Plan:
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
buck test mode/opt-clang //caffe2/caffe2/fb/predictor:ptvsc2_predictor_bench_test -- --run-disabled
```
Reviewed By: edvgha
Differential Revision: D27933176
fbshipit-source-id: 49d6249c8784c44146461e286e7a301596172d7c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56082
The native_functions.yaml changes were done by codemod using the
following script:
```
import ruamel.yaml
from ruamel.yaml.tokens import CommentToken
from ruamel.yaml.error import CommentMark
from tools.codegen.model import * # noqa: F403
with open("aten/src/ATen/native/native_functions.yaml", "r") as f:
contents = f.read()
yaml = ruamel.yaml.YAML()
yaml.preserve_quotes = True
yaml.width = 1000
yaml.boolean_representation = ['False', 'True']
r = yaml.load(contents)
convert = '''\
acos
acosh
asin
asinh
atan
atanh
cos
cosh
digamma
erf
erfc
erfinv
exp
expm1
exp2
lgamma
log
log10
log1p
log2
reciprocal
sigmoid
sin
sinc
sinh
special_entr
sqrt
tan
tanh'''.split()
for e in r:
f = NativeFunction.from_yaml(e, Location("", 0))
if f.structured or f.structured_delegate is not None:
continue
n = f.func.name.name.base
if n not in convert:
continue
# mutate e to make changes
if f.func.kind() == SchemaKind.out:
e.insert(1, 'structured', True)
e.insert(2, 'structured_inherits', 'TensorIteratorBase')
else:
# TODO: The .out overload assumption is not sound in general
e.insert(1, 'structured_delegate', f'{n}.out')
e['dispatch'].pop('CPU', None)
e['dispatch'].pop('CUDA', None)
e['dispatch'].pop('CPU, CUDA', None)
e['dispatch'].pop('CompositeExplicitAutograd', None)
*_, last_k = e.keys()
needs_fixup = False
if not e['dispatch']:
if last_k == 'dispatch':
needs_fixup = True
del e['dispatch']
# Manually fix up newlines at the end, because ruamel
# made some bad life choices about where to associate trailing
# whitespace for nested dicts; see
# https://stackoverflow.com/questions/42172399/modifying-yaml-using-ruamel-yaml-adds-extra-new-lines
if needs_fixup:
*_, last_k = e.keys()
# post_key, pre_key, post_value, pre_value
e.ca.items[last_k] = [None, None, CommentToken('\n\n', CommentMark(0), None), None]
with open("aten/src/ATen/native/native_functions.yaml.new", "w") as f:
yaml.dump(r, f)
```
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D27777769
Pulled By: ezyang
fbshipit-source-id: 1ecbac7cb3e0093167bb61c7d2b1ecb95b8ae17c
Summary:
This PR adds a `padding_idx` parameter to `nn.EmbeddingBag` and `nn.functional.embedding_bag`. As with `nn.Embedding`'s `padding_idx` argument, if an embedding's index is equal to `padding_idx` it is ignored, so it is not included in the reduction.
This PR does not add support for `padding_idx` for quantized or ONNX `EmbeddingBag` for opset10/11 (opset9 is supported). In these cases, an error is thrown if `padding_idx` is provided.
Fixes https://github.com/pytorch/pytorch/issues/3194
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49237
Reviewed By: walterddr, VitalyFedyunin
Differential Revision: D26948258
Pulled By: jbschlosser
fbshipit-source-id: 3ca672f7e768941f3261ab405fc7597c97ce3dfc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55337
`static_runtime::permute_copy` is in fb-only folder. Because `caffe2/test/test_static_runtime.py` is in OSS, we can't load the fb-only operator library. The workaround is to check at runtime whether the op is registered or not.
Test Plan:
This fixed two of the broken tests:
```
✓ Pass: caffe2/test:static_runtime - test_multihead_attention_layer (test_static_runtime.TestStaticModule) (10.316)
✓ Pass: caffe2/test:static_runtime - test_mlp (test_static_runtime.TestStaticModule) (16.134)
```
Reviewed By: ajyu
Differential Revision: D27577066
fbshipit-source-id: ac87dcde71f0d5140ccde448bb49aaebbbb5908a
Summary:
Converts loops of the form:
```
for(int64_t VAR=0;VAR<LIMIT;VAR++)
```
to the form
```
for(const auto VAR : c10::irange(LIMIT))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55148
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D27447811
fbshipit-source-id: 6311a094ec4a81a0b57383aaee0ba1b1dc2445c4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55098
resize_as_ still goes through the dispatcher because it calls tensor.resize_. We can easily call resize_ directly while bypassing the dispatcher.
Reviewed By: swolchok
Differential Revision: D27457894
fbshipit-source-id: 8a5da185d1a6addafbf4915e29613013451b5e43
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55096
There were issues with D26138322 (5b0a6482c1) that we didn't catch the first time around.
This (rebased on top of the to_copy fixes) fixes the converted remote_ro c2/pt output comparison
Test Plan:
```
MKL_NUM_THREADS=1 OMP_NUM_THREADS=1 numactl -m 0 -C 3 ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench --c2_model=/data/users/ansha/tmp/adfinder/210494966_0.predictor.disagg.remote_request_only --c2_inputs=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_input_data.pb --pred_net=/data/users/ansha/tmp/adfinder/models/c2_remote_ro_net2.pb --c2_sigrid_transforms_opt=1 --c2_apply_nomnigraph_passes=1 --c2_use_memonger=1 --scripted_model=/data/users/ansha/tmp/adfinder/models_dianshi/210494966_0.predictor.disagg.remote_request_only.pt --pt_inputs=/data/users/ansha/tmp/adfinder/models/remote_ro_wrapped_input_data.pt --pt_enable_static_runtime=1 --pt_cleanup_activations=1 --pt_enable_out_variant=1 --compare_results=1 --iters=1 --warmup_iters=1 --num_threads=1 --do_profile=0 --benchmark_c2_predictor=1 --do_benchmark=1
```
Reviewed By: hlu1
Differential Revision: D27477104
fbshipit-source-id: 5a95dfa7eae23566fadc3fec323ad03a34e6734d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438
August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime
Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```
```
TEST trace
Total num requests 100
Num exceptions 0
Latency us avg 180965
Latency us p25 89785
Latency us p50 131240
Latency us p75 146621
Latency us p90 158378
Latency us p95 166628
Latency us p99 1886680
Latency us p100 3803252
Server latency us avg 91554
Server latency us p25 51447
Server latency us p50 86371
Server latency us p75 95229
Server latency us p90 102706
Server latency us p95 116023
Server latency us p99 557017
Server latency us p100 716319
Num rankUnits avg 28
```
Reviewed By: hlu1
Differential Revision: D27236682
fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54467
`at::native::copy_` requires src/dest to have the same sizes, which isn't true in reshape.
Test Plan: Added new test cases to cover this case.
Reviewed By: ajyu
Differential Revision: D27249617
fbshipit-source-id: 2c95175fa8564b3c648979445ad4314f97818852
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54353
The current implementation of reshape/flatten is problematic because whether the output is sometimes a tensor view and sometimes not. It entirely depends on the graph ir and input shapes. Replacing them with the copy version makes it deterministic and the output is always a tensor.
Reviewed By: ajyu, edvgha
Differential Revision: D26358525
fbshipit-source-id: ee7571317b061221a8d50083676cded388ce6f87
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53669
This PR does two things:
* Ports `pow` to be structured
* Fixes a bug with how pow handles mixed cpu and cuda tensors
**bug fix**
Pow is a binary op, and all binary ops that use TensorIterator are currently written to handle the case when one of the inputs is a CUDA tensor, and the other is a zero-dimensional cpu tensor.
`pow` incidentally only handles one of the two cases: it fails when the CUDA tensor is passed as the exponent, e.g. `at::pow(torch.tensor(2.0, device='cpu'), torch.tensor([2, 2], device='cuda'))`. Porting `pow` to structured happened to change the error that was outputted from a `TORCH_CHECK` in TensorIterator to an `INTERNAL_ASSERT` in loop.cuh, so I ended up trying to fix the error and update the tests. I added more details in a comment on the PR.
**notes on the structured port**
Pow is a little weird, so I wrote down a couple of issues I noticed during the port:
* Multiple independent overloads. `pow` has two overloads that have their own cpu/cuda kernels, meaning one doesn't call the other. I have to update the names of the kernel overloads to make the compiler happy, since the codegen would otherwise try to generate two classes with the same name. `pow` actually has 3 overloads that all have `out` variants, so I ported all 3 to structured- one of them just happens to redispatch one of the others in most cases.
* Name propagation. Is name propagation implemented per operator? Or is expected to work for most/all ops by default. Right now it looks like it happens for TensorIterator ops by default. For ops that don't use TensorIterator, we need to explicitly pass the names through to the `set_output()` call in the meta function. This happened to matter for `pow` because it has 3 overloads, but only two of them directly use TensorIterator. I had to pass names directly to `set_output` in the 3rd overload to make tests happy.
* Lack of `const Tensor &` in the C++ API. It's a goal to slowly make all `Tensor &` arguments const as part of the structured port, but in this case I needed to explicitly cast constness away because one structured kernel called back into the C++ API, which still has ordinary `Tensor &` arguments. This probably isn't something we'll fix soon, since we have boxing logic that actually relies on the `Tensor &` / `const Tensor &` distinction in some places.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D27029821
Pulled By: bdhirsh
fbshipit-source-id: c1786e770de6e6c2474b9a48210b88057ab1018e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54229
Because caffe2 add uses Eigen for add with broadcasting which is not well supported by OSS PyTorch, it's easier to just keep the `c2_add_out` internal for now. Caffe2 does use mkl add when the input dims of A and B are the same and there is no broadcasting needed.
Reviewed By: bertmaher
Differential Revision: D27036279
fbshipit-source-id: 49f0ec5407ea1f641896f054cad2283faed81687
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52692
Porting `at::mul` to structured.
One other issue I hit with the port was the fact that there are a bunch of other places around the code base that used to call out to variants of `at::native::mul`, which no longer exists. *Technically*, `at::cpu::mul` does the equivalent thing now, so I patched most call-sites to use that. There were two other places where I did something slightly different (calling `at::cuda::mul` and `at::mul`, respectively), which I called out in the comments.
Test Plan: Imported from OSS
Reviewed By: ezyang
Differential Revision: D27029822
Pulled By: bdhirsh
fbshipit-source-id: 6cc80de0dfccec304bf8e16a1823e733bed27bf4