Commit Graph

99 Commits

Author SHA1 Message Date
Natalia Gimelshein
ec1af11c2e Revert D30883290: [Static Runtime] Move MemoryPlanner out into memory_planner.cpp
Test Plan: revert-hammer

Differential Revision:
D30883290 (0e11454d19)

Original commit changeset: a37570f8d943

fbshipit-source-id: 65c57a2b0d2e3c7006765195dd519e8cf2472f72
2021-09-15 15:40:34 -07:00
Don Jang
0e11454d19 [Static Runtime] Move MemoryPlanner out into memory_planner.cpp (#65011)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65011

This change moves `MemoryPlanner` out of impl.cpp into memory_planner.cpp.

`MemoryPlanner` performs an independent sub-task of static analysis of a graph, and creating memory planning, and allocating/deallocating managed Tensors.

This change will reduce merge conflicts as I work on MemoryPlanner more actively for output Tensor support.

Test Plan: N/A

Reviewed By: mikeiovine

Differential Revision: D30883290

fbshipit-source-id: a37570f8d9430224a6987d2190bcf81cf875043d
2021-09-15 12:57:39 -07:00
Don Jang
3fb33b38b9 [Static Runtime] Check if outputs of a node do not overlap with each other (#63013)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63013

This change enhances the current memory overlapping check to include outputs: the enhancement enforces a constraint that all outputs of a node should NOT overlap with each other since they are supposed to be update by a node at the same time, holding the node's outputs.

This check will detect a problem like T97393697 immediately in debug mode.

Test Plan:
- Added a unittest `ProcessedNode.VerifyMemoryOverlapWithOverlappingOutputs`

- Ran `inline_cvr` on ./buck-out/opt/gen/caffe2/caffe2/fb/predictor/ptvsc2_predictor_bench with this diff and confirmed that the checking condition holds true during the run.

Reviewed By: hlu1

Differential Revision: D30211705

fbshipit-source-id: 994d8dace2422e2498e504eb61452a55739238c0
2021-09-15 08:38:05 -07:00
Mike Iovine
369db8924f [Static Runtime] Add first iter metric (#64457)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64457

The first iteration is special since it initializes the memory planner. This change logs and reports first iteration time during benchmarking. It also generates a FAI-PEP output when `generate_ai_pep_output` is set.

Test Plan:
Run any benchmark, and observe:
```
I0902 15:19:32.528977 2492358 impl.cpp:948] PyTorchObserver {"value":6.415958881378174,"unit":"ms","metric":"latency","type":"static_runtime_first_iter"}
...
First iter time: 6.41596 ms
```

Note that this metric is likely to have significantly more noise than the others since we don't have as many data points.

Unit tests: `buck test //caffe2/test:static_runtime`

Reviewed By: d1jang

Differential Revision: D30740619

fbshipit-source-id: 4dcfccd5629f4fa34254fd355073ef19e151245a
2021-09-07 15:00:30 -07:00
Mike Iovine
4aad366111 [Static Runtime] Make per-op latency readable by FAI-PEP (#64315)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64315

Add a new flag `generate_ai_pep_output` to `StaticRuntime::benchmark`. If set, produces per-op-kind average total latency in milliseconds in a JSON format recognized by [Facebook AI performance evaluation platform (FAI-PEP)](https://github.com/facebook/FAI-PEP).

This is useful for observing the impact of changes that make a big difference for a specific op, but do not affect the overall SR latency by more than a few percent.

Reviewed By: hlu1

Differential Revision: D30679352

fbshipit-source-id: c847fa6ea20774aaf1e7949b11db4421d1f70b7e
2021-09-01 14:34:22 -07:00
Zhengxu Chen
ac99d63f83 [jit] Make operation call accept Stack& instead Stack* (#63414)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63414

Misuse of raw pointer in here where stack is never nullable.
ghstack-source-id: 136938318

Test Plan:
compiles.

Imported from OSS

Reviewed By: ejguan

Differential Revision: D30375410

fbshipit-source-id: 9d65b620bb76d90d886c800f54308520095d58ee
2021-08-30 11:49:20 -07:00
Mike Iovine
07c5cb8c48 [Static Runtime] Optimize memory planner initialization (#64101)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64101

Checking `getOutOfPlaceOperation(n)` is a very expensive operation, especially in multithreaded environments, due to a lock acquisition when the NNC cache is queried. This slows down the memory planner initialization time, and by extension, the latency for the first static runtime inference.

There are two optimizations in this diff:
* Cache the result of `p_node->has_out_variant()` to avoid the call to `getOutOfPlaceOperation`. This speeds up calls to `canReuseInputOutputs`, which in turn speeds up `isOptimizableContainerType`
* Precompute all `isOptimizableContainerType` during static runtime initialization to avoid a pass over all of each node's inputs.

Test Plan: All unit tests pass: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: movefast1990

Differential Revision: D30595579

fbshipit-source-id: 70aaa7af9589c739c672788bf662f711731864f2
2021-08-27 17:40:43 -07:00
Don Jang
c90b3cb1da [Static Runtime] Manage temporary Tensors for aten::layer_norm (#64078)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64078

This change converts `aten::layer_norm -> output Tensor` to `static_runtime::layer_norm -> (output Tensor, temp1 Tensor, tmp2 Tensor)` to manage `tmp1` and `tmp2` Tensors by the static runtime.

Currently the out-variant of `aten::layer_norm` creates two temporary Tensors inside it:
```
    at::Tensor mean = create_empty_from({M}, *X);
    at::Tensor rstd = create_empty_from({M}, *X);
```
that the static runtime misses an opportunity to manage.

This change puts them into (unused) output Tensors of a new placeholder op `static_runtime::layer_norm` so that the static runtime can mange them since the static runtime as of now chooses to manage only output tensors.

Test Plan:
- Enhanced `StaticRuntime.LayerNorm` to ensure that `static_runtime::layer_norm` gets activated.

- Confirmed that the new op gets activated during testing:

```
V0825 12:51:50.017890 2265227 impl.cpp:1396] Switch to out variant for node: %8 : Tensor, %9 : Tensor, %10 : Tensor = static_runtime::layer_norm(%input.1, %normalized_shape.1, %4, %4, %5, %3)

```

Reviewed By: hlu1

Differential Revision: D30486475

fbshipit-source-id: 5121c44ab58c2d8a954aa0bbd9dfeb7468347a2d
2021-08-27 02:44:43 -07:00
Hao Lu
3c3bba4169 [Static Runtime] Use F14FastMap/F14FastSet (#63999)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63999

Use folly::F14FastMap/F14FastSet instead of std::unordered_map/unordered_set in the Static Runtime code base. folly::F14FastMap/F14FastSet implements the same APIs as std::unordered_map/unordered_set but faster. For details see https://github.com/facebook/folly/blob/master/folly/container/F14.md

Reviewed By: d1jang

Differential Revision: D30566149

fbshipit-source-id: 20a7fa2519e4dde96fb3fc61ef6c92bf6d759383
2021-08-27 01:40:41 -07:00
Mike Iovine
7774a4e95b [Static Runtime] Implement prim::VarStack out variant (#63579)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63579

Provide a static runtime out variant implementation for the new op introduced in D30426232 (1385f9fb12).

Test Plan: `buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- IndividualOps_VarStack`

Reviewed By: navahgar

Differential Revision: D30410525

fbshipit-source-id: bc59a3d8ad23e3d94561ec2dca9cc20687dbadf8
2021-08-24 09:44:29 -07:00
Mike Iovine
d96ef8c1b1 [Static Runtime] SR clones graph input (#63704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63704

Previously SR did not clone the graph. This was leading to subtle bugs in `testStaticRuntime`; static runtime would modify its graph, and the graph used by the JIT interpreter would change as well. The JIT interpreter would then crash if SR-only ops were added!

Cloning the graph is more consistent with the behavior of the `Module` ctor.

Test Plan: `buck test caffe2/benchmarks/static_runtime/...`

Reviewed By: hlu1

Differential Revision: D30463294

fbshipit-source-id: b771551a1f55f95fde79373b23babcf3e5ddf726
2021-08-23 18:45:41 -07:00
Mike Iovine
fc6dd0bc00 [JIT] Move UseVariadicCat internals (#63577)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63577

Since other variadic ops will have an almost identical implementation, we can generalize the `UseVariadicCat` implementation and put it in a common folder.

Also moved some test utilities that other variadic op tests will likely need.

Test Plan: `buck test caffe2/test/cpp/jit:jit -- ConcatOptTest`

Reviewed By: navahgar

Differential Revision: D30409937

fbshipit-source-id: 925c11c27b58ce98cb8368d2a205e26ba66d3db9
2021-08-23 17:30:36 -07:00
Mike Iovine
779a3d47b0 [Static Runtime] Benchmark reports native nodes (#63346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63346

We have seen that we can get significant perf wins essentially for free by implementing native ops for ops that we cannot write out variants for (e.g. TupleUnpack D30306955 (078b8004a6), append D30326461 (9d9e7a8d72)). Therefore, whether or not SR is using a native implementation is valuable information. By capturing this in the benchmarking suite, we can hopefully avoid wasting time profiling/manually inspecting `native_ops.cpp`

Reviewed By: hlu1

Differential Revision: D30346752

fbshipit-source-id: 205b090513b6a5a6ce4cb92f75ab0395b15d08f9
2021-08-18 15:05:08 -07:00
Don Jang
075024b9a3 [Static Runtime] Fix a bug that assigns multiple outputs to single storage (#63012)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63012

This change fixes a bug that the static runtime's memory optimizer assigns multiple outputs of a node to the same storage.  Fixing this bug enables the static runtime to run `inline_cvr` with its memory optimizer enabled.

A problematic line from `inline_cvr` was as follows:
```
  %7767 : Tensor, %getitem_6419.1 : Tensor = fb::gather_ranges(%tensor74.1, %7764)
```
where enabling the memory optimizer assigns `%7767` and `%getitem_6419.1` to the same storage, which made their data corrupted during the 2nd iteration.

This change fixed the aforementioned bug by marking all inputs & outputs of a node as `alive` during our liveness analysis. By doing that, no inputs / outputs will collide with each other. I believe this is a fair assumption that most ops' implementation always has, but missing in our analysis before this change.

Test Plan: - Added a unittest `StaticRuntime.ValuesShareSameStorageDoesNotContainOutputsFromSameNode` to cover the new code.

Reviewed By: hlu1

Differential Revision: D30202018

fbshipit-source-id: 10287a1bee9e86be16a5201e9a7cd7c7f046bab9
2021-08-16 16:52:02 -07:00
Hao Lu
aa63c0d9df [PyPer] Skip printing out per node time when do_profile is on (#63256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63256

This suppresses printing out the per node time which is very long when the net has too many ops. It can be easily turned on by setting `--pt_sr_print_per_node_time=1`.

Reviewed By: ajyu, mikeiovine

Differential Revision: D30298331

fbshipit-source-id: 32b3f93b3fe19d335654168311fda93331a1e706
2021-08-16 16:32:19 -07:00
Richard Barnes
8720369a48 irange-ify 12b (#62484)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/62484

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D30015528

fbshipit-source-id: c4e1a5425a73f100102a97dcec1579f1049c9c1d
2021-08-09 16:40:47 -07:00
Nikita Shulga
30214aef2d [BE] irangefy (#62928)
Summary:
Replace for loop with for `irange` loop. Also fix some unused variable warnings in range loop cases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62928

Reviewed By: driazati

Differential Revision: D30171904

Pulled By: malfet

fbshipit-source-id: 1b437a0f7e3515f4a2e324f3450e93312f1933ae
2021-08-07 13:34:13 -07:00
Hao Lu
cf3cc01f1d [Static Runtime] Add is_frozen to StaticModule ctor (#62020)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62020

Add is_frozen to StaticModule ctor so we can skip freezing in StaticModule.

Reviewed By: ajyu, mikeiovine

Differential Revision: D29807431

fbshipit-source-id: 7742e9f5c5ae9f442a9e4007c870a14fd8b4af20
2021-07-23 15:12:35 -07:00
Raghavan Raman
ae58a4c45d [Static Runtime] Added a variadic cat operator (#61302)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61302

Test Plan: Imported from OSS

Reviewed By: hlu1

Differential Revision: D29565344

Pulled By: navahgar

fbshipit-source-id: 96f5f4546ec0e61eb7f87e016e026e7b62576248
2021-07-21 15:58:20 -07:00
Hao Lu
a07b08136f [Static Runtime] Check unsupported up when enabling static runtime (#61613)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61613

Reviewed By: ajyu, movefast1990

Differential Revision: D29663466

fbshipit-source-id: d819903b7227f534c0a4fffa5eeea2b5c0c04750
2021-07-14 02:13:51 -07:00
Don Jang
8a2c7d902f [static runtime] Add DCHECK to ensure that outputs do not overlap with immutable inputs (#61301)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61301

This change adds a `DCHECK` to ensure that outputs do not overlap with immutable inputs.

Test Plan:
Added unittests as follows:

- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithImmutableArguments`
- `ProcessedNode.VerifyOutputsNotOverlappingWithImmutableInputsWithMutableArguments`

Reviewed By: hlu1

Differential Revision: D29564158

fbshipit-source-id: bf14b4978ab544af79010cf724ed28202b4521cc
2021-07-12 18:04:05 -07:00
Ansha Yu
5a20c56ebc [static runtime] Remove hasOperation() check (#61496)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61496

glow::FusionGroup is JitOnlyOperator that produces an Operation when passed a Node* https://fburl.com/ybwfn3bl

hasOperation doesn't return true in that case https://fburl.com/19wd10aw

by removing the hasOperation() check, the Operation gets successfully materialized, and static runtime enables successfully and runs ok. Will check that the outputs match with jit interpreter

Test Plan:
Test with 281805158_2
```
./buck-out/gen/admarket/lib/ranking/prediction_replayer/replayer --model_inference_type_target=DISAGG_ACCELERATOR --prediction_replayer_force_model_type=inline_cvr_post_imp_model --prediction_replayer_force_model=281805158_2 --prediction_replayer_target_tier=127.0.0.1:7447 --prediction_replayer_input_stream_filename=/data/users/ansha/tmp/adfinder/filter_requests_inline_cvr_post_imp_model_1000_2021_04_29 --ignore_model_id_mismatch --check_performance --fully_remote_sr_connection_options="overall_timeout:10000000,processing_timeout:10000000" --use_new_encoding_for_ads_services --use_new_encoding_from_model_id_to_shard_id --sigrid_force_model_dir=/data/users/ansha/tmp/adfinder/281805158_2/ --sigrid_predictor_model_suffix=.predictor.disagg.local —use_new_encoding_from_model_id_to_shard_id=true --prediction_replayer_force_model_kind=19 --pytorch_predictor_static_runtime_enable=true --prediction_replayer_target_qps=1
```

```
NNPI_LOG_LEVEL=0 USE_INF_API=1 ./buck-out/gen/sigrid/predictor/sigrid_remote_predictor_glow_nnpi \
  --force_models=281805158_2 \
  --sigrid_predictor_model_suffix=.predictor.disagg.remote_other \
  --gflags_config_path=sigrid/predictor/gflags/predictor_gflags_ads_perf_glow_nnpi_pyper_v1 \
  --smc_server_port=7447 \
  --sigrid_predictor_tier_name=sigrid.predictor.perf.dianshi_staticruntime_debug_0604.test.storage \
  --predictor_storage_smc_tier=sigrid.predictor.perf.dianshi_staticruntime_debug_0604.test.storage \
  --predictor_storage_smc_tier_v2=sigrid.predictor.perf.dianshi_staticruntime_debug_0604.test.storage \
  --torch_glow_min_fusion_group_size=30 \
  --glow_enable_sanitize_inputs=100 \
  --sigrid_force_model_dir=/data/users/ansha/tmp/adfinder/281805158_2/ \
  --pytorch_predictor_static_runtime_enable=true \
  --pytorch_predictor_glow_enable=true \
  --pytorch_predictor_enable_loading_xl_format_on_cpu=false \
  --pytorch_disagg_acc_input_dump_path=/tmp/
```

Reviewed By: hlu1

Differential Revision: D29647043

fbshipit-source-id: 8ce6dc0f4f0464b65ca6a8c9d42e3d8bb392e66e
2021-07-12 10:09:33 -07:00
Hao Lu
7d7b7abb3b [Static Runtime] Separate function for getting always_alive values (#61506)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61506

Separate out the logic of GetAlwaysAliveValues from GetLivenessMap so to simplify the code structure. Also you don't need to run GetLivenessMap if optimize_memory is turned off.

Reviewed By: ajyu

Differential Revision: D29423534

fbshipit-source-id: dbdeeb10f7bcad86a24aa12f741f7c9ab946bb3b
2021-07-10 16:59:29 -07:00
Hao Lu
ccd0977060 [Static Runtime] Support prim::GetAttr/SetAttr (#61505)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61505

The handling of `self` in static runtime was previously incorrect. This diff fixed that issue, since self is essential to prim::GetAttr/SetAttr. After all, most of the time we're getting and setting attributes from self, the torch script module.

Reviewed By: ajyu

Differential Revision: D29350173

fbshipit-source-id: 6e62add4cda517ef8cd6c315d4cb0595e7d531fb
2021-07-10 14:06:06 -07:00
Mike Guo
6ecc1a4c4f Make pytorch clang-tidy clean (#60649)
Summary:
This PR suppresses clang-tidy warnings in the codebase (for now) so that we can re-enable clang-tidy checks on master.

I ran this script to add the `NOLINTNEXTLINE` comments (on a devserver):
```bash
python3 setup.py develop

# Uses same script that's run on CI and adds the -j (parallel), -s (add comments), -k (continue if diagnostic errors are found) options
python3 tools/clang_tidy.py \
  -j \
  -s \
  -k \
  -v \
  --paths torch/csrc/ \
  -g"-torch/csrc/jit/passes/onnx/helper.cpp" \
  -g"-torch/csrc/jit/passes/onnx/shape_type_inference.cpp" \
  -g"-torch/csrc/jit/serialization/onnx.cpp" \
  -g"-torch/csrc/jit/serialization/export.cpp" \
  -g"-torch/csrc/jit/serialization/import.cpp" \
  -g"-torch/csrc/jit/serialization/import_legacy.cpp" \
  -g"-torch/csrc/onnx/init.cpp" \
  -g"-torch/csrc/cuda/nccl.*" \
  -g"-torch/csrc/cuda/python_nccl.cpp" \
  -g"-torch/csrc/autograd/FunctionsManual.cpp" \
  -g"-torch/csrc/generic/*.cpp" \
  -g"-torch/csrc/jit/codegen/cuda/runtime/*" \
  -g"-torch/csrc/deploy/interpreter/interpreter.cpp" \
  -g"-torch/csrc/deploy/interpreter/interpreter.h" \
  -g"-torch/csrc/deploy/interpreter/interpreter_impl.h" \
  -g"-torch/csrc/deploy/interpreter/test_main.cpp"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/60649

Test Plan: Verified changes by re-running the script (without the `-s` option) and seeing no warnings/errors.

Reviewed By: walterddr, janeyx99

Differential Revision: D29504258

Pulled By: 1ntEgr8

fbshipit-source-id: 78310b30ee8213b73ddb4771ad874665323e7a4e
2021-07-01 12:21:07 -07:00
Hao Lu
e3abccec8a [Static Runtime] Remove output type constraints (#60669)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60669

Test Plan: Added unit test to check for nested outputs.

Reviewed By: ajyu

Differential Revision: D29322025

fbshipit-source-id: a3c8d3c5f0bb7cf7fda4bc5f579adb8fa7bc3724
2021-06-26 02:36:27 -07:00
Edvard Ghazaryan
f240624080 displays graph node's info (#59679)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59679

Displays info about graph's nodes

Test Plan:
Expected view:

%wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
	i0: Tensor CPUFloatType {32, 50}
	i1: Tensor CPUFloatType {1, 50}
	i2: int {1}
	o0: Tensor CPUFloatType {32, 50}
%wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
	i0: Tensor CPUFloatType {32, 50}
	i1: Tensor CPUFloatType {1, 50}
	o0: Tensor CPUFloatType {32, 50}
%wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
	i0: Tensor CPUFloatType {32, 50}
	i1: double {0}
	i2: double {10}
	o0: Tensor CPUFloatType {32, 50}
%user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
	i0: Tensor CPUFloatType {32, 1, 32}
	i1: int {1}
	i2: int {2}
	o0: Tensor CPUFloatType {32, 32, 1}
%dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
	i0: Tensor CPUFloatType {32, 1, 32}
	i1: Tensor CPUFloatType {32, 32, 1}
	o0: Tensor CPUFloatType {32, 1, 1}
%31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
	i0: Tensor CPUFloatType {32, 1, 1}
	i1: int {1}
	i2: int {-1}
	o0: Tensor CPUFloatType {32, 1}
%19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
	i0: Tensor CPUFloatType {32, 1}
	i1: Tensor CPUFloatType {32, 50}
	o0: TensorList {2}
%input.1 : Tensor = aten::cat(%19, %4)
	i0: TensorList {2}
	i1: int {1}
	o0: Tensor CPUFloatType {32, 51}
%fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
	i0: Tensor CPUFloatType {1}
	i1: Tensor CPUFloatType {32, 51}
	i2: Tensor CPUFloatType {51, 1}
	i3: int {1}
	i4: int {1}
	o0: Tensor CPUFloatType {32, 1}
%23 : Tensor = aten::sigmoid(%fc1.1)
	i0: Tensor CPUFloatType {32, 1}
	o0: Tensor CPUFloatType {32, 1}
%24 : (Tensor) = prim::TupleConstruct(%23)
	i0: Tensor CPUFloatType {32, 1}
	o0: Tuple {1}

Reviewed By: hlu1

Differential Revision: D28592852

fbshipit-source-id: 09174014f7d0ce25c511025d2b376f14e16c8a4a
2021-06-10 10:33:30 -07:00
Richard Barnes
fbe65b16ae Use irange in torch/csrc/jit (#55716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55716

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27690245

fbshipit-source-id: 6052b0acd792a9527d131822453a17cdb7ae3ba5
2021-06-07 16:48:08 -07:00
Richard Barnes
3979cb0656 irange for size_t (#55320)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55320

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27572577

fbshipit-source-id: 97710fd2bb1303006b05828a0d1343b0b59ccb03
2021-06-03 01:04:13 -07:00
Hao Lu
c00eefb6c7 [Static Runtime] Clean up and fix bugs in Static Runtime (#58829)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58829

- Delete copying and moving of MemoryPlanner.
- Remove `inline` in some of the member functions because member functions implemented in classes are inline by default.
- Clean up ad update comments.
- Reorganize some code

Reviewed By: edvgha

Differential Revision: D28555476

fbshipit-source-id: 7ea8efc0e2ed93a6788a742470b9e753a85df677
2021-05-24 19:46:58 -07:00
Edvard Ghazaryan
a7f06e1e55 Added statistic related to out variant nodes
Summary: added more statistic info for static runtime

Test Plan:
caffe2/benchmarks/static_runtime:static_runtime_cpptest

Expected output example:

Static runtime ms per iter: 0.939483. Iters per second: 1064.41
Node #0: 0.195671 ms/iter, %wide_offset.1 : Tensor = aten::add(%wide.1, %self._mu, %4)
Node #1: 0.169457 ms/iter, %wide_normalized.1 : Tensor = aten::mul(%wide_offset.1, %self._sigma)
Node #2: 0.118218 ms/iter, %wide_preproc.1 : Tensor = aten::clamp(%wide_normalized.1, %5, %6)
Node #3: 0.038814 ms/iter, %user_emb_t.1 : Tensor = aten::transpose(%user_emb.1, %4, %7)
Node #4: 0.0860747 ms/iter, %dp_unflatten.1 : Tensor = aten::bmm(%ad_emb_packed.1, %user_emb_t.1)
Node #5: 0.0102666 ms/iter, %31 : Tensor = static_runtime::flatten_copy(%dp_unflatten.1, %4, %8)
Node #6: 0.000476333 ms/iter, %19 : Tensor[] = prim::ListConstruct(%31, %wide_preproc.1)
Node #7: 0.0707332 ms/iter, %input.1 : Tensor = aten::cat(%19, %4)
Node #8: 0.123695 ms/iter, %fc1.1 : Tensor = aten::addmm(%self._fc_b, %input.1, %29, %4, %4)
Node #9: 0.0309244 ms/iter, %23 : Tensor = aten::sigmoid(%fc1.1)
Node #10: 0.0046297 ms/iter, %24 : (Tensor) = prim::TupleConstruct(%23)
Time per node type:
       0.195671 ms.    23.0483%. aten::add (1 nodes)
       0.169457 ms.    19.9605%. aten::mul (1 nodes, out variant)
       0.123695 ms.    14.5702%. aten::addmm (1 nodes, out variant)
       0.118218 ms.     13.925%. aten::clamp (1 nodes, out variant)
      0.0860747 ms.    10.1388%. aten::bmm (1 nodes, out variant)
      0.0707332 ms.    8.33175%. aten::cat (1 nodes, out variant)
       0.038814 ms.    4.57195%. aten::transpose (1 nodes)
      0.0309244 ms.    3.64263%. aten::sigmoid (1 nodes, out variant)
      0.0102666 ms.    1.20932%. static_runtime::flatten_copy (1 nodes, out variant)
      0.0046297 ms.   0.545338%. prim::TupleConstruct (1 nodes, out variant)
    0.000476333 ms.  0.0561079%. prim::ListConstruct (1 nodes, out variant)
       0.848959 ms. in Total
StaticRuntime setup time: 0.018925 ms
Memory allocation time: 0.019808 ms
Memory deallocation time: 0.0120445 ms
Outputs deallocation time: 0.0864947 ms
Total memory managed: 19328 bytes
Total number of reused tensors: 3
Total number of 'out' variant nodes/total number of nodes: 9/11 (81.8182%)

Reviewed By: hlu1

Differential Revision: D28553029

fbshipit-source-id: 55e7eab50b4b475ae219896100bdf4f6678875a4
2021-05-20 13:57:07 -07:00
Ansha Yu
eb1ffa91d8 [pyper] allow static runtime on and glow on simultaneously (#57972)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57972

Allow static runtime to be on when glow is on. This should be fine as long as glow AOT has already been run.

Test Plan: Test on replayer with remote_other net. D28291326 fixes remaining issue removing loops from the remote_other model. Need to test on regenerated model.

Reviewed By: hlu1

Differential Revision: D28275514

fbshipit-source-id: ee78972660dfdc3fcfb9af2cf7ebb19ee745a4f1
2021-05-11 12:24:07 -07:00
CodemodService FBSourceClangFormatLinterBot
cbfce376a8 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28319469

fbshipit-source-id: 8295597a8ee16b2fef3f7aacdd6c892cb22db988
2021-05-10 03:39:31 -07:00
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Hao Lu
5439977352 [Static Runtime] Revamp op schema check (#57521)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57521

When an op is added to static runtime, we manually check the schema (not with the jit schema check, more with IValue.IsTensor()/IsInt() etc) and make sure it's the one we do support. If the schema doesn't match, SR would throw an exception with TORCH_CHECK, which makes the entire graph invalid for SR.

This diff tries to make the op with unsupported schema to use the fallback path and make it go through the dispatcher instead:

```
  if (node->kind() != prim::ListConstruct &&
      node->kind() != prim::TupleConstruct &&
      node->kind() != prim::DictConstruct && node->kind() != prim::ListUnpack) {
    const Operator& op = node->getOperator();
    TORCH_CHECK(op.hasOperation());
    op_ = op.getOperation(node);
    VLOG(1) << "Fallback interpreter for node: " << PrintNode(node);
  }
```

The 2-arg `torch.norm`, which the SR `torch.norm impl doesn't support (only 3, 4, 5 args are supported), now can run in static runtime with fallback mode.

(Note: this ignores all push blocking failures!)

Reviewed By: ajyu

Differential Revision: D27531447

fbshipit-source-id: 0a9c2662ac73ed0393a23cc3a2c7df45fdb00fdd
2021-05-04 02:48:04 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Edvard Ghazaryan
a09bbe73fd static runtime support for fb::equally_split (#56812)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812

fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .

Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op

Reviewed By: hlu1

Differential Revision: D27974999

fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35
2021-04-26 20:18:09 -07:00
Hao Lu
e4efc0c948 [Static Runtime] Enable check_for_memory_leak in StaticRuntime::benchmark (#56839)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56839

Enable check_for_memory_leak at the end of StaticRuntime::benchmark so this code is exercised more often.

Test Plan: Checked with adindexer merge net model

Reviewed By: edvgha

Differential Revision: D27417911

fbshipit-source-id: 5248942dc439fcc7301ffb0005da76374939fa96
2021-04-23 19:54:58 -07:00
Xiaodong Wang
ed0a0c3578 Revert D27902824: static runtime support for fb::equally_split
Test Plan: revert-hammer

Differential Revision:
D27902824 (a4e47ea152)

Original commit changeset: 7855047c3bd4

fbshipit-source-id: a46834418ce98826871cd604d1a01f0ff8f23d7f
2021-04-23 10:03:12 -07:00
Edvard Ghazaryan
a4e47ea152 static runtime support for fb::equally_split (#56565)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56565

fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .

Test Plan:
buck test caffe2/torch/fb/sparsenn:fb_operators_test

buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op

Reviewed By: hlu1

Differential Revision: D27902824

fbshipit-source-id: 7855047c3bd46bbb74b7346ac384c70b6a3e1f46
2021-04-23 00:12:54 -07:00
Hao Lu
33f206b865 [StaticRuntime] Replace StorageImpl with TensorImpl in MemoryPlanner (#56447)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56447

MemoryPlanner shouldn't manage StorageImpls; instead, it should manage the TensorImpls because the StorageImpl in Tensors can change.

Test Plan: CI

Reviewed By: ajyu

Differential Revision: D27840361

fbshipit-source-id: f22165d167c70165be2934c6717b5057a8bb4d29
2021-04-20 23:04:01 -07:00
Peng Wu
1a116a9332 [Static runtime] Add optimize_graph_output_memory flag (#55811)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55811

- Added manage_graph_output_memory flag to opts (default false)
- Added checking for flag dependency between enable_out_variant and optimize_graph_output_memory and optimize_memory
- Minor refactoring for readability

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime

Reviewed By: hlu1

Differential Revision: D27573780

fbshipit-source-id: 28698657f686f27b8ad60e1276cdf17402d2cf91
2021-04-14 15:41:18 -07:00
Peng Wu
18662d4321 [Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning (#55809)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55809

[Static runtime] refactor MemoryPlanner codes to prepare for output tensor memory planning

Test Plan: buck test mode/dev //caffe2/caffe2/fb/predictor:pytorch_predictor_test -- --exact 'caffe2/caffe2/fb/predictor:pytorch_predictor_test - PyTorchPredictor.StaticRuntime'

Reviewed By: bwasti

Differential Revision: D27411416

fbshipit-source-id: 7dae7c2586ce3b4ebacf6169017140166c30e99c
2021-04-13 11:04:47 -07:00
Ailing Zhang
c6d9ca0c2b [reland]Replace AutoNonVariableTypeMode with InferenceMode in static runtime. (#55731)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55731

Forgot to export the diff in my last one. Retry...

Test Plan:
https://www.internalfb.com/intern/aibench/details/3752129704
https://www.internalfb.com/intern/aibench/details/1306815519

Reviewed By: hlu1

Differential Revision: D27694660

fbshipit-source-id: b351338fa789b9e9c7337df9b1bc1bc0fc387f5d
2021-04-12 09:48:20 -07:00
Ailing Zhang
5a8cdc2fdb Revert D27691509: Replace AutoNonVariableTypeMode with InferenceMode in static runtime.
Test Plan: revert-hammer

Differential Revision:
D27691509 (d695ba94f6)

Original commit changeset: d43db028a399

fbshipit-source-id: 8cfa2f821ef3251b323483691672ed70858d9d68
2021-04-09 20:36:20 -07:00
Ailing Zhang
d695ba94f6 Replace AutoNonVariableTypeMode with InferenceMode in static runtime.
Test Plan:
https://www.internalfb.com/intern/aibench/details/3752129704
https://www.internalfb.com/intern/aibench/details/1306815519

Reviewed By: hlu1

Differential Revision: D27691509

fbshipit-source-id: d43db028a399bb02166a539577f6922237145f83
2021-04-09 20:04:00 -07:00
Mike Ruberry
c0ac0fef4e Revert D27448156: irange for size_t
Test Plan: revert-hammer

Differential Revision:
D27448156 (041b4431b2)

Original commit changeset: 585da57d4de9

fbshipit-source-id: 8e047c29f391c0166e0a1a87c3fb2a0854377365
2021-04-03 19:14:00 -07:00
Richard Barnes
041b4431b2 irange for size_t (#55163)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/55163

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D27448156

fbshipit-source-id: 585da57d4de91c692b6360d65f7b8a66deb0f8c1
2021-04-02 23:22:29 -07:00
Peng Wu
fe2c1268b7 More name refactoring of memory planning codes to make it more readable (#54272)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/54272

Test Plan: Imported from OSS

Reviewed By: bwasti

Differential Revision: D27233881

fbshipit-source-id: f257f16ac0684df055961e539f17d002cb8f1bfe
2021-03-24 19:52:35 -07:00
Ansha Yu
afe339d7dd [static runtime] support DictConstruct (#54438)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54438

August 1x model has DictConstruct in the graph (P331168321)
These can be easily removed with jit pass, but to easily measure the improvement
and run replayer with the model in the meantime, enable DictConstruct in static runtime

Test Plan:
```
./sigrid/predictor/scripts/pytorch/pyper_inference_e2e_local_replayer_test.sh \
    cpu 218841466_0 7449 /data/users/ansha/tmp/adfinder/august_1x/ /data/users/ansha/tmp/adfinder/august_1x/filtered_requests_inline_cvr_100
```

```
TEST trace
Total num requests                                   100
Num exceptions                                         0
Latency us avg                                    180965
Latency us p25                                     89785
Latency us p50                                    131240
Latency us p75                                    146621
Latency us p90                                    158378
Latency us p95                                    166628
Latency us p99                                   1886680
Latency us p100                                  3803252
Server latency us avg                              91554
Server latency us p25                              51447
Server latency us p50                              86371
Server latency us p75                              95229
Server latency us p90                             102706
Server latency us p95                             116023
Server latency us p99                             557017
Server latency us p100                            716319
Num rankUnits avg                                     28
```

Reviewed By: hlu1

Differential Revision: D27236682

fbshipit-source-id: 1da49a836dd7533480e77797338baa9edcb65fb5
2021-03-23 21:20:03 -07:00