Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70248
Modified loops in files under fbsource/fbcode/caffe2/ from the format
```
for(TYPE var=x0;var<x_max;x++)
```
to the format
```
for(const auto var: irange(xmax))
```
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D32813863
fbshipit-source-id: 527244b4a2b220fdfe7f17dee3599603f492a2ca
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68485
In OSS, the only change is that we make the predict_net field of PredictorExporterMeta nullable.
Test Plan: sandcastle, let CI run
Reviewed By: boryiingsu
Differential Revision: D32467138
fbshipit-source-id: 81bd5fca695462f6a186bcfa927073874cc9c26a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66743
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
Test Plan: Sandcastle
Reviewed By: malfet
Differential Revision: D31705359
fbshipit-source-id: c9ea2fbc0f9cd29e97a52dcb203addc5f2abb09b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234
Modified loops in files under fbsource/fbcode/caffe2/ from the format
`for(TYPE var=x0;var<x_max;x++)`
to the format
`for(const auto var: irange(xmax))`
This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.
bypass_size_limit
allow-large-files
Test Plan: Sandcastle
Reviewed By: ngimel
Differential Revision: D30652629
fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66443
For some reason, this logging is adding noise to a lot of flow jobs. I am not sure if this is actually needed.
This is called from the __init__ so it's logged all the time and logs all key:values the current local symbol.
Test Plan: N/A
Reviewed By: chowarfb
Differential Revision: D31534372
fbshipit-source-id: bed032b66fed548c97a6f66b1b9e905fd2738851
Summary:
Addresses this network risk mitigation mentioned in https://github.com/pytorch/pytorch/issues/65439#issuecomment-924627239.
I didn't include any mobile app/benchmarking changes because I think the pretrained matters there.
I ended up removing the changes in test_utils because those were sensitive to the pretrained variable.
I am saving the quantization test changes for another PR because they are currently disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66312
Reviewed By: ejguan
Differential Revision: D31542992
Pulled By: janeyx99
fbshipit-source-id: 57b4f70247af25cc96c57abd9e689c34641672ff
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66056
keep running into this unrelated failure when landing diffs regarding the gpu inference project,
disabling this operator unit test in gpu because it doesn't exist
RuntimeError: [enforce fail at operator.cc:277] op. Cannot create operator of type 'SmartDecaySparseAdam' on the device 'CUDA'. Verify that implementation for the corresponding device exist. It might also happen if the binary is not linked with the operator implementation code. If Python frontend is used it might happen if dyndep.InitOpsLibrary call is missing. Operator def: input: "param" input: "mom1" input: "mom2" input: "last_seen" input: "indices" input: "grad" input: "lr" input: "iter" output: "param" output: "mom1" output: "mom2" output: "last_seen" name: "" type: "SmartDecaySparseAdam" arg { name: "beta1" f: 0 } arg { name: "beta2" f: 0.9 } arg { name: "epsilon" f: 1e-05 } device_option { device_type: 1 }
https://www.internalfb.com/intern/testinfra/diagnostics/5910974579962988.562949996565057.1633122845/
Test Plan: sandcastle
Reviewed By: jianyuh
Differential Revision: D31364731
fbshipit-source-id: 7fbd994cbe7f6ca116f5f34506a1ed7f14759bdf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65610
- Replace HIP_PLATFORM_HCC with USE_ROCM
- Dont rely on CUDA_VERSION or HIP_VERSION and use USE_ROCM and ROCM_VERSION.
- In the next PR
- Will be removing the mapping from CUDA_VERSION to HIP_VERSION and CUDA to HIP in hipify.
- HIP_PLATFORM_HCC is deprecated, so will add HIP_PLATFORM_AMD to support HIP host code compilation on gcc.
cc jeffdaily sunway513 jithunnair-amd ROCmSupport amathews-amd
Reviewed By: jbschlosser
Differential Revision: D30909053
Pulled By: ezyang
fbshipit-source-id: 224a966ebf1aaec79beccbbd686fdf3d49267e06
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64244
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64040
In operator cost inference functions, in many places we are using sizeof(x.data_type()). Since data_type() returns a 32 bit integer from [this enum](https://www.internalfb.com/code/fbsource/[15e7ffe4073cf08c61077c7c24a4839504b964a2]/fbcode/caffe2/caffe2/proto/caffe2.proto?lines=20), we are basically always getting 4 for sizeof(x.data_type()) no matter what actual data type x has. Big thanks to Jack Langman for specifically pointing to this bug.
We would instead use the size in bytes based on actual data type.
Test Plan:
Added unit tests BatchMatMulMemCostTest:
buck test //caffe2/caffe2/fb/fbgemm:batch_matmul_op_test -- BatchMatMulMemCostTest
Extended existing unit test test_columnwise_concat for different data types:
buck test //caffe2/caffe2/python/operator_test:concat_op_cost_test -- test_columnwise_concat
Reviewed By: CrazySherman
Differential Revision: D30656698
fbshipit-source-id: d42c0c9a0c5b0ddc5dba39e4994f1f85a5e618bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64040
In operator cost inference functions, in many places we are using sizeof(x.data_type()). Since data_type() returns a 32 bit integer from [this enum](https://www.internalfb.com/code/fbsource/[15e7ffe4073cf08c61077c7c24a4839504b964a2]/fbcode/caffe2/caffe2/proto/caffe2.proto?lines=20), we are basically always getting 4 for sizeof(x.data_type()) no matter what actual data type x has. Big thanks to Jack Langman for specifically pointing to this bug.
We would instead use the size in bytes based on actual data type.
Test Plan:
Added unit tests BatchMatMulMemCostTest:
buck test //caffe2/caffe2/fb/fbgemm:batch_matmul_op_test -- BatchMatMulMemCostTest
Extended existing unit test test_columnwise_concat for different data types:
buck test //caffe2/caffe2/python/operator_test:concat_op_cost_test -- test_columnwise_concat
Differential Revision: D30561459
fbshipit-source-id: 976fa5167097a35af548498480001aafd7851d93
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63133
SplitOp is costly but missing cost inference function which hurts cost based balancing. Changes are:
(1) Addition of CostInferenceFunction for SplitOp
(2) Small fix in CostInferenceFunction for ConcatOp
Test Plan:
Added unit tests:
buck test //caffe2/caffe2/python/operator_test:split_op_cost_test
buck test //caffe2/caffe2/python/operator_test:concat_op_cost_test
Reviewed By: smacke
Differential Revision: D30247360
fbshipit-source-id: 989e962f3a981acc85b73aac3fb23e603b7d1591
Summary:
Add `-Wno-writable-strings`(which is clang's flavor of `-Wwrite-strings`) to list of warnings ignored while compiling torch_python.
Avoid unnecessary copies in range loop
Fix number of signed-unsigned comparisons
Found while building locally on M1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62930
Reviewed By: albanD
Differential Revision: D30171981
Pulled By: malfet
fbshipit-source-id: 25bd43dab5675f927ca707e32737ed178b04651e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63068
The caffe2 core.Net constructor can accept a caffe2_pb2.NetDef proto, but it always creates a copy. This is wasteful when we can prove that the proto being passed to it will not be used anywhere else. So we add an "inplace" argument to the `core.Net` constructor that allows clients to give away ownership of the passed proto without copying. We default this argument to `False`, ensuring that behavior does not change unless explicitly requested.
Test Plan: Let CI run.
Differential Revision: D29976510
fbshipit-source-id: 26e13ca76f3431b8ef0de51f08bbf263491d323e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62493
This diff adds a broadcast fastpath for the caffe2 broadcast utility function, which just copies the contents of a smaller tensor into a larger one. We also update the tests to exercise the new functionality.
Test Plan: unit tests + let CI run
Differential Revision: D29938285
fbshipit-source-id: 543ecc548500380e307be91902696033454964a2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62437
In this diff we add a broadcast fastpath for MulGradient and DivGradient ops, whose tests we update to exercise the new functionality.
Test Plan: Added test cases to elementwise ops (which will exercise the new MulGradient / DivGradient broadcast fastpath functionality) that will be run by CI. It's worth noting there's still no code (outside of the new test cases) that takes the new code paths added -- the user must explicitly request allow_broadcast_fastpath=True, and nothing outside of the added tests currently does so.
Differential Revision: D29938273
fbshipit-source-id: 281c1a109e38c25b9bf9ff8d832de60ac3c231a9
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62428
In this diff we add a broadcast fastpath for reduce utility functions. These functions are used by various elementwise ops, whose tests we update to exercise the new functionality.
Test Plan: Added test cases to elementwise ops (which will exercise the new reducer functionality) that will be run by CI. It's worth noting there's still no code (outside of the new test cases) that takes the new code paths added -- the user must explicitly request `allow_broadcast_fastpath=True`, and nothing outside of the added tests currently does so.
Differential Revision: D29938264
fbshipit-source-id: 5d5542bd93afb85fd9f7a4073f766adc07eb3b65
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62059
GetOperatorCost in Workspace exposes flops and bytes_written only. Make the an additional piece, bytes_read, available from OperatorSchema::Cost.
Test Plan:
Added the two additional pieces in the unit test testGetOperatorCost in workspace_test
buck test caffe2/caffe2/python:workspace_test -- testGetOperatorCost
buck test //aml/ml_foundation/exp_platform/large_scale_training/distributed_hogwild/auto_device_placement/tests/...
buck test //aiplatform/training/autotuning/tests/...
buck test //aiplatform/training/pipelining/tests/...
buck test //deeplearning/fblsim/tests/...
Flow tests:
ADP Greedy: f288078287
ADP MILP: f288079278
Reviewed By: CrazySherman, xtaofb
Differential Revision: D29860676
fbshipit-source-id: 8b3a9f2bf17c0dae48cfe2800e8821bf441e0b03
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62135
The initial implementation of Adam with Smart Decay had an off-by-one error. This was in the summation of the geometric series used to calculate how much built-up momentum would have been discharged in skipped minibatches.
The unit tests should have caught these, but the testing strategy missed this because k, the "number of skipped minibatches" was always either 0 or so high that the impact of the bug was too small. The impact of the bug was proportional to 1/k. The testing strategy has also been adjusted to cover this bug.
Differential Revision: D29889309
fbshipit-source-id: b086c0efed5c27f621061e726533c73658daffc6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62058
This is the second diff in this stack. This diff includes the changes to DPER3; the first diff includes the changes to Caffe2.
We want to decay learning parameters properly. Previously this was not done when a parameter is absent from a minibatch. We fix this by keeping track of missed minibatches and making decay catch up accordingly.
The exponential moving averages (EMA) for the first and second moments used in Adam are updated only for parameters seen in a minibatch. Actually, for these parameters, 0 should be added to the EMAs and the EMAs should then be decayed by multiplying by beta1 and beta2 respectively.
To avoid the computational overhead of touching every parameter for every minibatch, we:
* keep track of the last time a parameter is seen
* instead of decaying the EMAs by multiplying by beta1 and beta2, we multiply by beta1^k and beta2^k, where k is the number of minibatches since the parameter was last seen.
We hope this will significantly improve the inconsistent learning parameter issue we have seen with Adam.
Differential Revision: D29638897
fbshipit-source-id: 18d8e227d72c2e23010ca81e0f6eeb78872c8d3c
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`
All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`; do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008
Reviewed By: driazati, r-barnes
Differential Revision: D29838584
Pulled By: malfet
fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
Summary: A test case that triggers db_options with the save operator is missing.
Test Plan: buck test
Differential Revision: D29642719
fbshipit-source-id: 72b7374d40430398abac26dfe91538550525384d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61548
We want to decay learning parameters properly. Previously this was not done when a parameter is absent from a minibatch. We fix this by keeping track of missed minibatches and making decay catch up accordingly.
The exponential moving averages (EMA) for the first and second moments used in Adam are updated only for parameters seen in a minibatch. Actually, for these parameters, 0 should be added to the EMAs and the EMAs should then be decayed by multiplying by beta1 and beta2 respectively.
To avoid the computational overhead of touching every parameter for every minibatch, we:
* keep track of the last time a parameter is seen
* instead of decaying the EMAs by multiplying by beta1 and beta2, we multiply by beta1^k and beta2^k, where k is the number of minibatches since the parameter was last seen
* we calculate the amount of momentum that would have been discharged over the missed minibatches and update the weight accordingly.
Differential Revision: D29654246
fbshipit-source-id: 7a6cd7966eb1f31116d99dfce79a78b2d3ee9e3e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61551
We aim to enable rate limiter in C2 load, with a fix bandwidth limit.
This diff update LoadOp to pass down the manifold db options.
Test Plan:
```
buck test mode/opt caffe2/caffe2/python/operator_test:load_save_test
```
Differential Revision: D29639102
fbshipit-source-id: cf69549adadf4c7f12a8a2b7f3ca39092cab4b99
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/61488
We want to decay learning parameters properly. Previously this was not done when a parameter is absent from a minibatch. We fix this by keeping track of missed minibatches and making decay catch up accordingly.
The exponential moving averages (EMA) for the first and second moments used in Adam are updated only for parameters seen in a minibatch. Actually, for these parameters, 0 should be added to the EMAs and the EMAs should then be decayed by multiplying by beta1 and beta2 respectively.
To avoid the computational overhead of touching every parameter for every minibatch, we:
* keep track of the last time a parameter is seen
* instead of decaying the EMAs by multiplying by beta1 and beta2, we multiply by beta1^k and beta2^k, where k is the number of minibatches since the parameter was last seen.
Differential Revision: D27978269
fbshipit-source-id: e47524101ddfcb281c46c505b9b7a8f0835bc64a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60402
Add float64 data type support for ScatterWeightedSum for cases that 10^7 precision is not sufficient.
Test Plan: buck test caffe2/caffe2/python/operator_test:sparse_ops_test -- testScatterWeightedSum
Reviewed By: jianyuh
Differential Revision: D29190324
fbshipit-source-id: 871a60744694e901a2c7685a67350860745d6729
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59775
This operator is similar to `GetAllBlobNames` but also returns the estimated
size required to serialize each node.
One goal of this operator is to allow checkpoint saving logic to estimate the
amount of space/bandwidth required to save a checkpoint when first starting
training, without actually serializing any blobs yet. Currently the
checkpointing logic uses `GetAllBlobNames` to determine the blobs to
checkpoint. It can instead be updated to use `EstimateAllBlobSizes` to also
get an estimate for how much space will be required for the checkpoint.
ghstack-source-id: 132275153
Test Plan: Included a new unit test.
Reviewed By: mraway
Differential Revision: D29020227
fbshipit-source-id: 811e5d86c4b59183e84e6424c48c97739be09043
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60382
Instead of setting weight_decay w uniformly for all ids, for each row i in the sparse embedding table, the actual weight_decay `w_i` becomes `w*freq_i` where `freq_i = halflife/counter_i \in [\log(2), halflife]`. Counter is from `rowwise_counter` with definition `counter_i = 1 + \exp(-iter_{\delta}*\rho)*counter_i`.
Test Plan:
buck test //caffe2/caffe2/python/operator_test:adagrad_test -- test_row_wise_sparse_adagrad
buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay
Reviewed By: 0x10cxR1
Differential Revision: D25581030
fbshipit-source-id: 54b3831b20516c76c559b13d8deb809e2ee3b446
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/60106
In Caffe2, some elementwise in-place compatible ops lack coverage for the in-place case. We add tests for a subset of them here and thereby increase coverage.
Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:elementwise_ops_test
```
Let CI run.
Reviewed By: clrfb
Differential Revision: D29143189
fbshipit-source-id: 83138ad8eff8fe95c40aece53714da3577396a23
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57080
ONNX optimizer is removed in ONNX 1.9
This PR removes ONNX optimizer from a C++ code path and uses `try-except` block in Python to keep it compatible with both ONNX-1.8 and 1.9.
Test Plan: Imported from OSS
Reviewed By: heitorschueroff
Differential Revision: D28467330
Pulled By: malfet
fbshipit-source-id: 5e4669dd0537648898e593f9e253da18d6dc7568
Co-authored-by: neginraoof <neginmr@utexas.edu>
Co-authored-by: Nikita Shulga <nshulga@fb.com>
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58062
Make templated function to make sure BatchSparseToDense supports int32 lengths/indices
Test Plan:
```buck test //caffe2/caffe2/python/operator_test:batch_sparse_to_dense_op_test
```
Reviewed By: khabinov
Differential Revision: D28271423
fbshipit-source-id: 41b88b7a3663616b533aaf4731ff35cdf6ec4c85
Summary: Relax test deadlines for c2 tests. We run on loaded machines, and timings are unreliable.
Test Plan: Fixes existing tests
Reviewed By: mruberry
Differential Revision: D28690006
fbshipit-source-id: 457707e81a1ec92548c1f23ea7a0022fa0a3bfda
Summary: Tests are frequently failing with "exceeded the deadline of 1000.00ms", we expect this to happen, so remove the deadline
Test Plan: N/A: Fix breakages
Reviewed By: robieta
Differential Revision: D28581051
fbshipit-source-id: 4825ada9af151fa5d57c45c549138c15ba613705
Summary: When run on very heavily loaded machines, some of these tests are timing out. It's not an issue with the test, it's an issue with the environment. I've removed the timeout so we at least keep unit test coverage.
Test Plan: N/A: Fix breakages
Reviewed By: ngimel
Differential Revision: D28492334
fbshipit-source-id: aed3ee371763161aab2d356f5623c7df053fda6f
Summary:
This is the only line (not in `third_party`) matching the regex `^#!.*python2`, and [it is not the first line of its file](https://github.com/koalaman/shellcheck/wiki/SC1128), so it has no effect. As a followup to https://github.com/pytorch/pytorch/issues/58275, this PR removes that shebang to reduce confusion, so now all Python shebangs in this repo are `python3`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58409
Reviewed By: walterddr
Differential Revision: D28478469
Pulled By: samestep
fbshipit-source-id: c17684c8651e45d3fc383cbbc04a31192d10f52f
Summary:
Some machines don't have a versionless `python` on their PATH, which breaks these existing shebangs.
I'm assuming that all the existing versionless `python` shebangs are meant to be `python3` and not `python2`; please let me know if my assumption was incorrect for any of these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/58275
Test Plan: CI.
Reviewed By: zhouzhuojie
Differential Revision: D28428143
Pulled By: samestep
fbshipit-source-id: 6562be3d12924db72a92a0207b060ef740f61ebf
Summary: Removed the deadline restriction since the first run can take more than the deadline, wile subsequent runs are shorter.
Reviewed By: ngimel
Differential Revision: D28260077
fbshipit-source-id: 8ed2f5c16bc184bf4fae0a59b662fa1da2d4dd0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/57296
Seems many trainers disable print(), so we cannot see the thread dumps with CompleteInTimeOrDie(). So log.info() also.
Test Plan: sandcastle
Reviewed By: aalmah
Differential Revision: D28098738
fbshipit-source-id: dfdca8801bacf5c7bccecc2387cb7ef41dadfa46
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os
def get_compiled_files_list():
import json
with open("build/compile_commands.json") as f:
data = json.load(f)
files = [os.path.relpath(node['file']) for node in data]
for idx, fname in enumerate(files):
if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
return files
def run_clang_tidy(fname):
check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
changes = check_output(["git", "ls-files", "-m"])
if len(changes) == 0:
return
check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])
def main():
git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
compiled_files = get_compiled_files_list()
for idx, fname in enumerate(git_files):
if fname not in compiled_files:
continue
if fname.startswith("caffe2/contrib/aten/"):
continue
print(f"[{idx}/{len(git_files)}] Processing {fname}")
run_clang_tidy(fname)
if __name__ == "__main__":
main()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892
Reviewed By: H-Huang
Differential Revision: D27991944
Pulled By: malfet
fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717
The signal_handler was under the caffe2 namespacee but was being used
by PyTorch as well.
I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch
can use it.
The signal_handler interface in caffe2/utils/signal_handler.h is kept the same
for backward compatiblity for C2, but most of the commmon code is moved to c10.
ghstack-source-id: 127446929
Test Plan: waitforbuildbot
Reviewed By: ezyang
Differential Revision: D27946738
fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56813
When the arg `pass_inputs_as_tensor_list` is True, the input tensors are wrapped into a TensorList and passes in as a single param.
Test Plan: buck test //caffe2/caffe2/python:workspace_test -- TestScriptModule
Reviewed By: dzhulgakov
Differential Revision: D27972928
fbshipit-source-id: 5a199649445b0306f3134086c85bd55da45e1a0b
Summary: `networkx 2.4+` replaced `node` attribute to `nodes` in graph object. This caused failures in `caffe2`'s' `topological_sort_traversal_longest_path` function which uses networkx library for topological sort.
Differential Revision: D27718857
fbshipit-source-id: 812fbb613946565d089cc84a20f3cdf7df046e19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55003
Using the `caffe2::setPrintStackTracesOnFatalSignal` utility in
distributed tests to set a signal handler that dumps the state of all threads
for all processes when it receives a FATAL signal. This would help in debugging
tests further.
I had to revert all the python faulthandler code since only one signal handler
function is supported, so running python faulthandler with
`setPrintStackTracesOnFatalSignal` doesn't work.
Sample output:
```
SIGSEGV(11), PID: 3492872, Thread 3492872:
[0] ???(0x7fa7b2d1d61b) in libcaffe2_caffe2_caffe2_cpu.so
[1] ???(0x7fa7b2d1d3fb) in libcaffe2_caffe2_caffe2_cpu.so
[2] ???(0x7fa7b2d1d33d) in libcaffe2_caffe2_caffe2_cpu.so
[3] ???(0x7fa7b2d1d167) in libcaffe2_caffe2_caffe2_cpu.so
[4] ???(0x7fa7ce683150) in libpthread.so.0
[5] ???(0x7fa7be2b233c) in libcaffe2__C_impl_cuda.so
[6] ???(0x7fa7be2ce80c) in libcaffe2__C_impl_cuda.so
[7] ???(0x7fa7be2a0512) in libcaffe2__C_impl_cuda.so
[8] torch::distributed::rpc::TensorPipeAgent::send(torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, float, std::unordered_map<signed char, signed char, std::hash<signed char>, std::equal_to<signed char>, std::allocator<std::pair<signed char const, signed char> > > const&)+0x24f(0x7fa7be29f71f) in libcaffe2__C_impl_cuda.so
[9] torch::distributed::autograd::sendMessageWithAutograd(torch::distributed::rpc::RpcAgent&, torch::distributed::rpc::WorkerInfo const&, torch::distributed::rpc::Message&&, bool, float, bool)+0x393(0x7fa7b602b203) in libcaffe2_libtorch.so
[10] torch::distributed::rpc::pyRpcPythonUdf(torch::distributed::rpc::WorkerInfo const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >&, std::vector<at::Tensor, std::allocator<at::Tensor> >&, float, bool)+0x201(0x7fa7bd844971) in libcaffe2__C_impl_cuda.so
```
ghstack-source-id: 125630551
Test Plan: waitforbuildbot
Reviewed By: SciPioneer
Differential Revision: D27419714
fbshipit-source-id: 8aca9a14ef688004053d8798124d9c3a3fbe3489
Summary:
*Context:* https://github.com/pytorch/pytorch/issues/53406 added a lint for trailing whitespace at the ends of lines. However, in order to pass FB-internal lints, that PR also had to normalize the trailing newlines in four of the files it touched. This PR adds an OSS lint to normalize trailing newlines.
The changes to the following files (made in 54847d0adb9be71be4979cead3d9d4c02160e4cd) are the only manually-written parts of this PR:
- `.github/workflows/lint.yml`
- `mypy-strict.ini`
- `tools/README.md`
- `tools/test/test_trailing_newlines.py`
- `tools/trailing_newlines.py`
I would have liked to make this just a shell one-liner like the other three similar lints, but nothing I could find quite fit the bill. Specifically, all the answers I tried from the following Stack Overflow questions were far too slow (at least a minute and a half to run on this entire repository):
- [How to detect file ends in newline?](https://stackoverflow.com/q/38746)
- [How do I find files that do not end with a newline/linefeed?](https://stackoverflow.com/q/4631068)
- [How to list all files in the Git index without newline at end of file](https://stackoverflow.com/q/27624800)
- [Linux - check if there is an empty line at the end of a file [duplicate]](https://stackoverflow.com/q/34943632)
- [git ensure newline at end of each file](https://stackoverflow.com/q/57770972)
To avoid giving false positives during the few days after this PR is merged, we should probably only merge it after https://github.com/pytorch/pytorch/issues/54967.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54737
Test Plan:
Running the shell script from the "Ensure correct trailing newlines" step in the `quick-checks` job of `.github/workflows/lint.yml` should print no output and exit in a fraction of a second with a status of 0. That was not the case prior to this PR, as shown by this failing GHA workflow run on an earlier draft of this PR:
- https://github.com/pytorch/pytorch/runs/2197446987?check_suite_focus=true
In contrast, this run (after correcting the trailing newlines in this PR) succeeded:
- https://github.com/pytorch/pytorch/pull/54737/checks?check_run_id=2197553241
To unit-test `tools/trailing_newlines.py` itself (this is run as part of our "Test tools" GitHub Actions workflow):
```
python tools/test/test_trailing_newlines.py
```
Reviewed By: malfet
Differential Revision: D27409736
Pulled By: samestep
fbshipit-source-id: 46f565227046b39f68349bbd5633105b2d2e9b19
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54042
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53881
1. Fix position_weighted optimizer: Position weighted layer uses default optimizer but is actually gradient_slice, which will cause problem if we do not handle it properly in the new optimizier. The solution is to use sparseadagrad when it is gradient_slices.
2. Optimizer implementation of v1 and v2: using 1st momentum with/without bias_correction.
3. also implemented decoupled weight decay in the new optimizer.
Test Plan:
buck test //caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_2 -- test_mlp_optimization
buck test //caffe2/caffe2/python:optimizer_test -- TestDecayAdagrad
buck test //caffe2/caffe2/python/operator_test:decay_adagrad_test
ctr_mbl_feed work flow: f255731660
oc work flow: f255739503
Reviewed By: 0x10cxR1
Differential Revision: D26839668
fbshipit-source-id: 2b6881c1a88540ef5766be40f5e80001257e2199
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53735
Add an option to BlobSerializationOptions to request that float data be
serialized as bfloat16. This reduces the serialized data size at the expense
of some loss in precision.
ghstack-source-id: 124317910
Test Plan: Included a new unit test.
Reviewed By: mraway
Differential Revision: D26658205
fbshipit-source-id: 74521ed161059066355a3f208488ed01a344dbb5
Summary: Add ability to reset optimizer counter..
Test Plan: will wait for integration tests to run on diff.
Differential Revision: D27248286
fbshipit-source-id: a608df1bd61b64eb317c9ffd9cfdd804c5288f6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54274
Some of the Python tests need to be aware of whether or not FBGEMM is
available, so expose this setting in the pybind extension.
ghstack-source-id: 124317732
Test Plan: Will use this variable in the tests on D26658205.
Reviewed By: mraway
Differential Revision: D27171780
fbshipit-source-id: 4c94144a959bf8bf0e1553b6e029e94a91794e29
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53402
Add an `options` field to the `Save` operator which accepts options for how to
serialize different blobs. At the moment this simply allows controlling the
existing `chunk_size` behavior, but in the future we can add other options,
such as the ability to control compression settings or other serialization
formats.
ghstack-source-id: 123567034
Test Plan:
Added a new test to `load_save_test.py` that passes in options and verifies
that blobs were serialized with the expected number of chunks.
buck test caffe2/caffe2:caffe2_test_cpu \
caffe2/caffe2/core:serialization_test \
caffe2/caffe2/python/operator_test:load_save_test
Reviewed By: mraway
Differential Revision: D26502577
fbshipit-source-id: 6e302e530bb96990517c2e35c505db7f14a56284
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53401
This is a reland of D26641599 (cd9ac54ea7) after rebasing onto D26802576 (f595ba1bae).
Add some small utility functions to read the blob names back from the minidb
file so that we can verify how many chunks were written for each blob.
ghstack-source-id: 123567033
Test Plan: buck test caffe2/caffe2/python/operator_test:load_save_test
Reviewed By: mraway
Differential Revision: D26853942
fbshipit-source-id: 0b45078fdd279f547752c8fdb771e296374a00da
Summary:
Context: https://github.com/pytorch/pytorch/pull/53299#discussion_r587882857
These are the only hand-written parts of this diff:
- the addition to `.github/workflows/lint.yml`
- the file endings changed in these four files (to appease FB-internal land-blocking lints):
- `GLOSSARY.md`
- `aten/src/ATen/core/op_registration/README.md`
- `scripts/README.md`
- `torch/csrc/jit/codegen/fuser/README.md`
The rest was generated by running this command (on macOS):
```
git grep -I -l ' $' -- . ':(exclude)**/contrib/**' ':(exclude)third_party' | xargs gsed -i 's/ *$//'
```
I looked over the auto-generated changes and didn't see anything that looked problematic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53406
Test Plan:
This run (after adding the lint but before removing existing trailing spaces) failed:
- https://github.com/pytorch/pytorch/runs/2043032377
This run (on the tip of this PR) succeeded:
- https://github.com/pytorch/pytorch/runs/2043296348
Reviewed By: walterddr, seemethere
Differential Revision: D26856620
Pulled By: samestep
fbshipit-source-id: 3f0de7f7c2e4b0f1c089eac9b5085a58dd7e0d97
Summary:
Add some small utility functions to read the blob names back from the minidb
file so that we can verify how many chunks were written for each blob.
Test Plan: buck test caffe2/caffe2/python/operator_test:load_save_test
Reviewed By: mraway
Differential Revision: D26641599
fbshipit-source-id: bccb0af157d85e585e95bc7be61c4584fba3cb04
Summary:
Add a test in `load_save_test.py` that passes in a chunk_size parameter,
to ensure that we exercise the logic that passes the chunk size to the C++
serialization code.
Test Plan:
Ran the tests with the vlog level set to 3 and manually verified the log
messages showed that we were serializing in the expected chunks.
There are existing C++ tests that confirm chunking behavior works as expected
in the pure C++ code.
Reviewed By: mraway
Differential Revision: D26502578
fbshipit-source-id: cd0074f2358da81c68b0fed2c2a94818d83a957d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52388
Pull Request resolved: https://github.com/pytorch/glow/pull/5364
This allows us to change global variables through onnxifi calls. And add python bindings along with it. Note that we supply a dummy backend_id as it's not needed by glow due to setting being global.
#codemod
Test Plan:
```
buck test mode/dev //glow/fb/test:test_onnxifi_optionnnpi
```
Reviewed By: jfix71, khabinov
Differential Revision: D26481652
fbshipit-source-id: 19b8201c77f653cf7d93ad68760aa7fb5ec45ff4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51768
This updates python/core.py to explicitly define all of the `DataType`
values rather than dynamically defining them at runtime from the
`caffe2_pb2` values.
This allows type checkers like Pyre and Mypy to see the members of the
`DataType` class. Otherwise the type checkers report errors such as
`"core.DataType" has no attribute "INT64"`.
This code does keep a run-time check that all of the data types defined
by `caffe2_pb2.proto` are defined correctly in this file. This way if
someone does add a new type to `caffe2_pb2.proto` it should be very
quickly apparent that this file needs to be updated and kept in sync.
ghstack-source-id: 121936201
Test Plan:
Confirmed that various caffe2/python tests still pass.
Verified that this allows many `pyre-fixme` comments to be removed in
downstream projects, and that Pyre is still clean for these projects.
Reviewed By: jeffdunn
Differential Revision: D26271725
Pulled By: simpkins
fbshipit-source-id: f9e95795de60aba67d7d3872d0c141ed82ba8e39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51767
The `_import_c_extension.py` finds the right C extension library to use,
and then simply re-exports all of the symbols that it defines.
This adds a `_import_c_extension.pyi` file with type hints to let type
checkers like Pyre and Mypy know the names of the symbols that will be
re-exported from the C extension.
This does not define all of the symbols provided by the C extension,
but does define all of the symbols necessary to make type checkers happy
about other code in the `caffe2/python` directory.
ghstack-source-id: 121916324
Test Plan:
Was able to have Pyre successfully type check the `caffe2/python`
directory with this stub file plus a few other changes.
Confirmed that all of the dependent projects affected by this report no new
pyre issues in sandcastle.
Ran `python test/test_type_hints.py` in the PyTorch github repository and
confirmed it also passes.
Differential Revision: D26271726
Pulled By: simpkins
fbshipit-source-id: 6dbadcf02e0b2cc44a9e3cdabe9291c1250959b4
Summary: Previously there was no regularizer implemented for fp16 sparse features. Add regularizer support here using the Float16SparseNormalize implemented in this stack.
Test Plan:
buck test //caffe2/caffe2/python:regularizer_test
In f248648705, we can see there is the operator `Float16SparseNormalize`.
{F356635445}
Reviewed By: bigrabithong
Differential Revision: D24042567
fbshipit-source-id: 5e0065f8c10b8748daffa8a54a6bf8f461460b18
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51762
Update test_util.py to add a `make_tempdir()` function to the `TestCase`
class. The main advantage of this function is that the temporary
directory will be automatically cleaned up when the test case finishes,
so that test case does not need to worry about manually cleaning up this
directory.
This also prefixes the directory name with `caffe2_test.` so that it is
more obvious where the temporary directories came from if they are ever
left behind after a crashed or killed test process.
This updates the tests in `operator_test/load_save_test.py` to use this
new function, so they no longer have to perform their own manual cleanup
in each test.
Test Plan: python caffe2/python/operator_test/load_save_test.py
Reviewed By: mraway
Differential Revision: D26271178
Pulled By: simpkins
fbshipit-source-id: 51175eefed39d65c03484482e84923e5f39a4768
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51766
Check if we are on Windows using `sys.platform` rather than
`platform.system()`. Even though `platform.system()` is more modern, it
has a few downsides: this performs a runtime check of the platform type,
which has non-zero overhead. On Linux it actually executes the separate
`/bin/uname` process. On the other hand `sys.platform` is determined
when the Python interpreter is compiled, so this is a simple hard-coded
string.
Because it is a runtime check, `platform.system()` checks also cannot be
analyzed by static type checkers like Pyre and Mypy. These type
checkers do understand `sys.platform` checks, and can correctly avoid
complaining about code paths that use platform-specific modules and
functions. e.g., they can avoid complaining about `ctypes.WinDLL` not
existing on Linux if its use is guarded by a `sys.platform` check.
ghstack-source-id: 121107705
Test Plan: Ran tests on Linux, and will check CI test results.
Reviewed By: mraway
Differential Revision: D26271724
Pulled By: simpkins
fbshipit-source-id: b86e427e4ceec0324464ba4bc88b95d5813172d0
Summary:
Increasing the deadline as to avoid
flakiness of the test on ROCM.
Signed-off-by: Roy, Arindam <rarindam@gmail.com>
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52013
Reviewed By: albanD
Differential Revision: D26360209
Pulled By: mrshenli
fbshipit-source-id: 1ddc7062c5ff7c980233d22844073de9fb7dcbb3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52083
This makes minor fixes in `caffe2/python` to address all errors currently
reported by Pyre.
I update the code to fix errors when doing so looked simple and safe,
and added `pyre-fixme` comments in other places.
ghstack-source-id: 121109695
Test Plan: Confirmed that Pyre no longer reports errors under `caffe2/python`
Differential Revision: D26272279
fbshipit-source-id: b1eb19d323b613f23280ce9c71e800e874ca1162
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/51769
Remove some Python 2 compatibility code that otherwise causes errors to
be reported from static type checkers.
Static type checkers complain that the old Python 2 modules and
functions referenced by this code do not exist. Given that Python 2
support is entirely deprecated now we can simply remove the
compatibility code.
ghstack-source-id: 121313191
Test Plan:
Was able to get Pyre to successfully type check the `caffe2/python`
directory with this and some other changes.
Reviewed By: Tianshu-Bao
Differential Revision: D26271723
Pulled By: simpkins
fbshipit-source-id: fec8a09466be6867388832380480aafd36616aa1
Summary: Moving caffe2_core_gpu_python contbuild to use GPU/RE
Test Plan: CI
Reviewed By: malfet
Differential Revision: D26261826
fbshipit-source-id: a6f8c7bd8368c1cb69499ea0ea7d5add0956a7ad
Summary:
The test is flaky on ROCM when deadline is set to 1 second. This is affecting builds as it is failing randomly.
Disabling for now.
Signed-off-by: Arindam Roy <rarindam@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50964
Reviewed By: houseroad
Differential Revision: D26049370
Pulled By: BIT-silence
fbshipit-source-id: 22337590a8896ad75f1281e56fbbeae897f5c3b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50393
Exponential Moving Average
Usage:
add ema_options in adagrad optimizer. For details, plz refer to the test workflow setting.
if ema_end == -1, it means ema will never end.
Test Plan:
buck test caffe2/caffe2/fb/optimizers:ema_op_optimizer_test
buck test caffe2/caffe2/fb/optimizers:ema_op_test
f240459719
Differential Revision: D25416056
fbshipit-source-id: a25e676a364969e3be2bc47750011c812fc3a62f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49486
Remove code for Python 3.5 and lower.
There's more that can be removed/modernised, but sticking mainly to redundant version checks here, to keep the diff/PR smaller.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46579
Reviewed By: zou3519
Differential Revision: D24453571
Pulled By: ezyang
fbshipit-source-id: c2cfcf05d6c5f65df64d89c331692c9aec09248e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49591
A bunch of these tests are marked flaky, and have been since time immemorial. (Read: as far back as Buck will build.) However closer inspection reveals that they fail if and only if run on a GPU worker. What seems to be going on is that there are more jobs than GPUs, so the contention causes waits which registers as timeouts on the test.
This diff is kind of hacky, but it basically just drops deadlines if a GPU is present. Because Caffe2 is going away I'm not too terribly concerned about a beautiful solution, but we may as well keep some test coverage if it's easy.
CC Sebastian, Ilia, Min, and Hongzheng who also have tasks for what seems to be the same flakiness.
Test Plan: Turn the tests back on and see if they fall over. (The failure repros reliably on an OnDemand GPU and is fixed by this change, so it's not really just a hail Mary.)
Reviewed By: ngimel
Differential Revision: D25632981
fbshipit-source-id: 43dcce416fea916ba91f891e9e5b59b2c11cca1a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49402
In cases of NCCLAllReduce operations there could be non-trivial overhead for
launching cooperative kernels (especially in case of async execution of
different parts of the model). This diff is reviving this operator to make it
possible to fuse multiple operations into a single kernel.
Test Plan:
Unit-test.
Used in a later diff.
Reviewed By: xianjiec
Differential Revision: D25531206
fbshipit-source-id: 64b1c161233a726f9e2868f1059316e42a8ea1fc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49322
In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve.
This operator allows to address these issues by introducing extra explicit dependencies between ops.
Test Plan:
Unit-test/
E2E testing in the future diffs.
Reviewed By: xianjiec
Differential Revision: D24933471
fbshipit-source-id: 1668994c7856d73926cde022378a99e1e8db3567
Summary:
+ Add ArgMin support to Caffe2 to PyTorch converter
+ Using hypothesis to parameterize different conditions for test
Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Reviewed By: houseroad
Differential Revision: D25016203
fbshipit-source-id: 94489fcf1ed3183ec96f9796a5b4fb348fbde5bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48240
Adds the support for converting the SparseLengthsSum4BitRowwiseSparse operator from caffe2 to pytorch as a part of c2_pt_converter
Test Plan:
Added a unit tested
buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Tests Passed :
https://our.intern.facebook.com/intern/testinfra/testrun/2251799856412296
Reviewed By: houseroad
Differential Revision: D25067833
fbshipit-source-id: 45cbc331ca35bee27e083714e65a1e87a2a2d2e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48340
This changes the context managed classes from using a decorator to define them to using inheritance. Inheritance allows the python static type checking to work correctly.
```
context.define_context()
class Bar(object): ...
context.define_context(allow_default=True)
class Foo(object): ...
```
becomes
```
class Foo(context.Managed): ...
class Bar(context.DefaultManaged): ...
```
Behavior differences:
* arg_name has been removed since it's not used anywhere
* classes need to call `super()` in `__enter__/__exit__` methods if they override (none do)
This also defines a context.pyi file to add types for python3. python2 support should not be affected
Test Plan:
ci
buck test //caffe2/caffe2/python:context_test //caffe2/caffe2/python:checkpoint_test
Reviewed By: dongyuzheng
Differential Revision: D25133469
fbshipit-source-id: 16368bf723eeb6ce3308d6827f5ac5e955b4e29a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48407
T79817692: Fused8BitRowwiseQuantizedToFloat operator support for c2_pt_converter.
Also refactored some repeated code from the existing test functions. (Initial commit only has refactoring.)
Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Reviewed By: bugra
Differential Revision: D25069936
fbshipit-source-id: 72f6a845a1b4639b9542c6b230c8cd74b06bc5a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48404
On bento this is printing a lot of msgs like (see N408483 if you're an internal user)
```
W1123 120952.322 schema.py:811] Scalar should be considered immutable. Only call Scalar.set() on newly created Scalar with unsafe=True. This will become an error soon.
```
And it's ignoring the log level I set at global level. Removing this line unless this is super important.
Test Plan: build a local dper package and verify
Differential Revision: D25163808
fbshipit-source-id: 338d01c82b4e67269328bbeafc088987c4cbac75
Summary: is_external_input doesn't check if the lookup tables are valid. Calling .Proto() should invalidate all lookup tables and have them rebuilt on call to any methods depending on them. This adds this check to is_external_input.
Test Plan: internal unit tests
Reviewed By: dzhulgakov, esqu1
Differential Revision: D25100464
fbshipit-source-id: d792dec7e5aa9ffeafda88350e05cb757f4c4831
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023
DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430
Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect
Reviewed By: dzhulgakov
Differential Revision: D24605460
fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47768
This stores the next ID for a given NextName(prefix, output_id) so repeated calls to NextName are significantly faster. This accounts for ~65% of time spent for large models.
Test Plan:
buck test //caffe2/caffe2/python/...
will launch canary job before landing to ensure no regressions + confirm speedup
Reviewed By: dzhulgakov
Differential Revision: D24876961
fbshipit-source-id: 668d73060d800513bc72d7cd405a47d15c4acc34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48021
Extending operator schema check for simple memonger to dag memonger as well. As part of this a fix is being made to handle inplace ops (having at least one output name same as input blob). Earlier all the output blobs from ops were being treated as shareable but it failed assertion of external input blobs with the same name not allowed to share.
Test Plan: Added corresponding unit tests
Reviewed By: hlu1
Differential Revision: D24968862
fbshipit-source-id: b6679a388a82b0d68f65ade64b85560354aaa3ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47718
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs.
As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.
Test Plan: Added corresponding tests in memonger_test.py . Could not find unit tests in c++ version of memonger.
Reviewed By: hlu1
Differential Revision: D24872010
fbshipit-source-id: 1dc99b2fb52b2bc692fa4fc0aff6b7e4c5e4f5b0
Summary: Added the MatMul operator for caffe2
Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Reviewed By: bugra
Differential Revision: D24920937
fbshipit-source-id: 7ba09ba0439cb9bd15d6a41fd8ff1a86d8d11437
Summary: To support min/max/mean/std, SummarizeOp need to skip size checking (similar to the LpNorm error mentioned above) and accept multiple types
Test Plan:
unit test:
`buck test //caffe2/caffe2/fb/tensorboard/tests:tensorboard_accumulate_histogram_op_test`
https://our.intern.facebook.com/intern/testinfra/testrun/1407375057859572
`buck test //caffe2/caffe2/fb/tensorboard/tests:tensorboard_accumulate_histogram_op_test --stress-runs 1000`
https://our.intern.facebook.com/intern/testinfra/testrun/2533274832166362
Reviewed By: cryptopic
Differential Revision: D24605507
fbshipit-source-id: fa08372d7c9970083c38abd432d4c86e84fb10e0
Summary:
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs. Here is one reference part0 predict net with AsyncIf ops: https://www.internalfb.com/intern/paste/P145812115/
As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.
Reviewed By: hlu1
Differential Revision: D24346771
fbshipit-source-id: ad2dd2e63f3e822ad172682f6d63f8474492255d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47541
The profiler has guided us to `schema.py`. Since these `Field`s are used everywhere and in huge quantities, we can easily make some optimizations system wide by adding `__slots__`.
From StackOverflow, benefits include:
* faster attribute access.
* space savings in memory.
Read more: https://stackoverflow.com/a/28059785/
Reviewed By: dzhulgakov
Differential Revision: D24771078
fbshipit-source-id: 13f6064d367440069767131a433c820eabfe931b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47542
The previous way of doing `Field.__init__(self, [])` is just wrong. Switching to Python2 compatible way: `super(ObjectName, self).__init__(...)`
Reviewed By: dzhulgakov
Differential Revision: D24771077
fbshipit-source-id: d6798c72090c0264b6c583602cae441a1b14587c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47530
`Net.AddExternalInput` should raise if there are duplicate names. The previous code would only raise if the addition of duplicates was in separate calls, but not if it was in the same call.
Test Plan:
Added two new regression tests
```
✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.622)
✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicate (caffe2.caffe2.python.core_test.TestExternalInputs) (9.639)
✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithoutBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.883)
✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicateInSameCall (caffe2.caffe2.python.core_test.TestExternalInputs) (10.153)
```
Test trained 2 models. No issues
f230755456
f230754926
Reviewed By: dzhulgakov
Differential Revision: D24763586
fbshipit-source-id: c87088441d76f7198f8b07508b2607aec13521ed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47512
I deleted the last line of `__init__` -- `self._field_offsets.append(offset)` -- and the unittests didn't fail.
So this diff is to improve test coverage.
Test Plan:
```
✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetEmptyParent (caffe2.caffe2.python.schema_test.TestField) (8.225)
✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetFieldOffsetsIfNoChildren (caffe2.caffe2.python.schema_test.TestField) (8.339)
✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetFieldOffsets (caffe2.caffe2.python.schema_test.TestField) (8.381)
```
Reviewed By: dzhulgakov
Differential Revision: D24767188
fbshipit-source-id: b6ce8cc96ecc61768b55360e0238f7317a2f18ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47475
This improves the core.Net cloning/init performance by quite a bit. It makes set_input_record run in linear time instead of O(n) by checking the external_input map instead of regenerating the external inputs each time and then iterating over it.
Test Plan: unit tests + canary runs
Reviewed By: dzhulgakov
Differential Revision: D24765346
fbshipit-source-id: 92d9f6dec158512bd50513b78675174686f0f411
Summary:
Add `last_n_window_collector` as C2 supports and PyTorch currently does not have this operator: https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/caffe2/operators/last_n_window_collector.cc?lines=139
## Problem that we are solving
This operator works on multiple pieces of data and collects last `n` element that has been seen.
If you have the following pieces of data that has been passed around:
```
[1, 2, 3, 4]
[5, 6, 7]
[8, 9, 10, 11]
```
for 3 times and the number of collector is given to be 6. The expected result is:
```
[6, 7, 8, 9, 10, 11]
```
What this means is that, almost like we need a FIFO(First in First Out) mechanism where as we are passing this data through the collector, we will be pushing some other data at the end.
In this particular example, in the first pass(the data is `[1, 2, 3, 4]`) , we hold `[1, 2, 3, 4]` in the queue as our queue size is 6.
In the second pass(the data is `[5, 6, 7]`), we hold `[2, 3, 4, 5, 6, 7]` in the queue and since 1 is inserted the last, it will drop due to the size limitation of the queue.
In the third pass(the data is `[8, 9, 10, 11]`), we hold `[6, 7, 8, 9, 10, 11]` in the queue and `2,3,4,5` are dropped due the the size of the queue.
For multidimension case, when we have the following data:
```
[[1, 2], [2, 3], [3, 4], [4, 5]]
[[5, 6], [6, 7], [7, 8]]
[[8, 9], [9, 10], [10, 11], [11, 12]]
```
and our queue size is 6.
In the first pass, we will have ` [[1, 2], [2, 3], [3, 4], [4, 5]]`
In the second pass, we will have `[2, 3], [3, 4], [4, 5]] [[5, 6], [6, 7], [7, 8]]`
In the third pass, we will have `[6, 7], [7, 8]] [[8, 9], [9, 10], [10, 11], [11, 12]]`
### The implementation
I am using FIFO queue in Python which is in the collections library. This accepts `maxlen` argument which can be used to set the size of the queue.
I am using last n indices of the tensor through list indices and in this operator, I am not doing copy.
In the test plan, I have both single dimension tensors as well as multi-dimension tensors.
### Benchmark
I used various different configurations and added a benchmark test. PyTorch implementation is much master than Caffe2 implementation:
#### CPU Benchmark
```
torch_response.median
0.00019254473969340324
caffe_response.median
0.00030233583599794657
```
#### GPU Benchmark
```
torch_response.mean
0.000081007429903838786
caffe_response.mean
0.00010279081099724863
```
Test Plan:
### For CPU:
```
buck test //caffe2/torch/fb/sparsenn:test
```
### For GPU:
- Used an on-demand machine and did the following commands:
```
jf get D24435544
buck test mode/opt //caffe2/torch/fb/sparsenn:test
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/4222124688138052/
Reviewed By: dzhulgakov, radkris-git
Differential Revision: D24435544
fbshipit-source-id: 8193b4746b20f2a4920fd4d41271341045cdcee1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46590
This operator is very similar to LengthsToRanges but doesn't pack the offsets next to the original lengths.
Reviewed By: yf225
Differential Revision: D24419746
fbshipit-source-id: aa8b014588bb22eced324853c545f8684086c4e4
Summary: I was reading/looking into how LocalSession works and realized that the workspace type being passed around was the bound function on TaskGroup instead of the actual type. This meant that all workspaces for localsession would always be global, because they'd never match the private workspace type.
Test Plan: <not sure, could use some suggestions>
Reviewed By: cryptopic
Differential Revision: D24458428
fbshipit-source-id: 0f87874babe9c1ddff25b5363b443f9ca37e03c1
Summary:
We've been seeing a lot of Hypothesis timeouts and from profiling a few of the failing tests one of the contributing factors is really slow grad checker. In short, it launches the whole op for each of the input elements so the overall complexity is O(numel^2) at least.
This applies a very unscientific hack to just run grad check on the first and last few elements. It's not ideal, but it's better than flaky tests. One can still explicitly opt in with the env var.
Reviewed By: malfet
Differential Revision: D23336220
fbshipit-source-id: f04d8d43c6aa1590c2f3e72fc7ccc6aa674e49d2
Summary: Similar to If operator, AsyncIf also contains nets in args. It needs the same handling.
Test Plan:
New unit test test_control_op_remap
`buck test caffe2/caffe2/python:core_test`
Also it worked end to end in prototype of dist bulk eval workflow f226680903
Reviewed By: yyetim
Differential Revision: D24451775
fbshipit-source-id: 50594e2ab9bb457329ed8da7b035f7409461b5f6
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46457
Wanted to see if using CopyMatrix specialized for float that uses mkl_somatcopy can be faster but it wasn't. Still want to check in benchmark that can be used later.
Test Plan: .
Reviewed By: dskhudia
Differential Revision: D24345901
fbshipit-source-id: d3e68dbb560e3138fda11c55789cd41bc0715c6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45551
The FP16 version of SparseNormalize op in Caffe2 is missing. This Diff adds FP16 support to unblock MC process of adding FP16 to Dper3.
Check https://fb.quip.com/L0T2AXGwUY3n#EReACAeifk3 .
One question is whether the pure FP16 Sparse Normalized op will affect the accuracy? Maybe we should do it in FP32 domain.
ghstack-source-id: 114184398
Test Plan:
```
buck run mode/opt //caffe2/caffe2/python/operator_test:sparse_normalize_test
```
```
buck run mode/opt -c python.package_style=inplace mode/no-gpu //caffe2/caffe2/python/benchmarks:sparse_normalize_benchmark -- --fp16
```
Reviewed By: jspark1105
Differential Revision: D24005618
fbshipit-source-id: 8b918ec4063fdaafa444779b95206ba2b7b38537