pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Bugra Akyildiz	27c7158166	Remove __future__ imports for legacy Python2 supports (#45033 ) Summary: There is a module called `2to3` which you can target for future specifically to remove these, the directory of `caffe2` has the most redundant imports: ```2to3 -f future -w caffe2``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/45033 Reviewed By: seemethere Differential Revision: D23808648 Pulled By: bugra fbshipit-source-id: 38971900f0fe43ab44a9168e57f2307580d36a38	2020-09-23 17:57:02 -07:00
Daya Khudia	09aee06e82	[caffe2] Replace embedding conversion ops with fbgemm functions (#44843 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44843 Replace perfkernels calls with fbgemm kernels to avoid code duplication ghstack-source-id: 112496292 Test Plan: CI Reviewed By: radkris-git Differential Revision: D23675519 fbshipit-source-id: 05c285a9eeb9ea109a04a78cb442a24ee40a4aec	2020-09-22 11:57:01 -07:00
Stanislau Hlebik	b774ce54f8	remediation of S205607 fbshipit-source-id: 798decc90db4f13770e97cdce3c0df7d5421b2a3	2020-07-17 17:19:47 -07:00
Stanislau Hlebik	8fdea489af	remediation of S205607 fbshipit-source-id: 5113fe0c527595e4227ff827253b7414abbdf7ac	2020-07-17 17:17:03 -07:00
Jongsoo Park	7a837019a4	[caffe2] optimize 2/4-bit row-wise quantization (#387 ) Summary: Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/387 Pull Request resolved: https://github.com/pytorch/pytorch/pull/39985 avx2 optimized 2/4-bit row-wise quantization/dequantization in perfkernels. This diff slightly change the numerics of quantization by multiplying with the inverse of scale instead of dividing with scale. Test Plan: In my devserver for i in 2 4 8; do echo $i; buck run mode/opt :fused_rowwise_nbit_conversion_bench -- --bit-rate=$i; done Before this diff 2-bit 3.35394 ms. 100%. FloatToFused2BitRowwiseQuantized 4-bit 3.60351 ms. 100%. FloatToFused4BitRowwiseQuantized 8-bit 0.434467 ms. 100%. FloatToFused8BitRowwiseQuantized After this diff 2-bit 0.606386 ms. 100%. FloatToFused2BitRowwiseQuantized 4-bit 0.446683 ms. 100%. FloatToFused4BitRowwiseQuantized 8-bit 0.4349 ms. 100%. FloatToFused8BitRowwiseQuantized Reviewed By: choudharydhruv, jianyuh Differential Revision: D22033195 fbshipit-source-id: d3a219e47b8345268d90a160c9314ed0d5b71467	2020-06-19 21:28:31 -07:00
Taiqing Wang	8cb1f2f9dc	implement L2 regularization for Adagrad in caffe2 and dper (#37705 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/37705 Pull Request resolved: https://github.com/pytorch/pytorch/pull/37372 Posted note: [Regularizing SparseNN Against Over-fitting](https://fb.workplace.com/notes/taiqing-wang/regularizing-sparsenn-against-over-fitting/220306075902708/) Problem formulation L(w) = J(w) + lambda/2 * \|\|w\|\|^2 J(w) is the empirical loss, and \|\|w\|\|^2 is the squared L2 norm of the parameters, a.k.a. L2 regularizer. dL(w)/ dw_i = dJ(w)/dw_i + lambda w_i dL(w)/ dw_i is the gradient of L(w) w.r.t. w_i. To implement the L2 regularizer, the gradient of J(w) w.r.t. w_i is added with w_i. lambda is called as weight decay in this implementation. Code changes * In the initialization method of AdagradOptimizer, a new input argument, weight_decay, is added. * In the _run function of AdagradOptimizer, the weight decay will be skipped for 1d bias vectors. * In the parameter update functions of Adagrad, the gradient is updated by weight_decay * w_i. The default value for weight_decay is zero. Test Plan: ` buck build caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_weight_decay ` ` ./buck-out/gen/caffe2/caffe2/fb/dper/layer_models/tests/split_1/sparse_nn_test_weight_decay#binary.par ` Reviewed By: jspark1105 Differential Revision: D21258652 fbshipit-source-id: d2366ddcd736a03205a2d16f914703b16d9fce8f	2020-05-03 10:42:49 -07:00
Dmytro Dzhulgakov	7576cf8d00	[caffe2] Use cpuinfo in perfkernels to simplify build dependency (#36371 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36371 It allows to drop circular dependency and remove unknown_symbols in Buck build. It'd be good to get rid of GetCpuId all together in favor of cpuinfo, but it's not really blocking anything Reviewed By: malfet Differential Revision: D20958000 fbshipit-source-id: ed17a2a90a51dc1adf9e634af56c85f0689f8f29	2020-04-10 13:26:34 -07:00
Evgeny Fiksman	e372f42110	[caffe2] Explicit vectorization of LSTM operator (#35556 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/35542 Apply explicit vectorization to lstm_unit operator. Enabled by -DENABLE_VECTORIZATION=1 This optimization requires vector library support and was tested with Intel SVML & clang. However, compiler which support OpenMP4.5 with omp simd extention should also benefit. After the code changes In file included from caffe2/caffe2/operators/lstm_unit_op.cc:1: caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize] VECTOR_LOOP for (int d = 0; d < D; ++d) { caffe2/caffe2/operators/lstm_unit_op.h:60:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize] caffe2/caffe2/operators/lstm_unit_op.h:112:1: remark: vectorized loop (vectorization width: 8, interleaved count: 1) [-Rpass=loop-vectorize] VECTOR_LOOP for (int d = 0; d < D; ++d) { Test Plan: Check failures at OSS CI - No build failures related to this change - Failing tests are: - py3.6-clang7-rocmdeb-ubuntu16.04-test2 >RuntimeError: fft: ATen not compiled with MKL support - caffe2_onnx_ort2_py3_6_clang7_ubuntu16_04_test - >gradient_check_test.py::TestMakeTwo Exited with code exit status 1 - pytorch_macos_10_13_py3_test , Test errors like: > ERROR [0.014s]: test_boolean_indexing_weirdness_cpu (__main__.NumpyTestsCPU) RuntimeError: shape mismatch: indexing tensors could not be broadcast together with shapes [0], [2] - caffe2_onnx_ort1_py3_6_clang7_ubuntu16_04_test - No failure info Reviewed By: jspark1105 Differential Revision: D20484640 fbshipit-source-id: 8fb82dbd6698c8de3e0bbbc0b48d15b70e36ca94	2020-04-01 17:19:56 -07:00
peter	e3daf70184	Fix AVX detection with clang-cl (#35653 ) Summary: Defining macros `/D__F16C__` or sth similar won't work on clang-cl. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35653 Differential Revision: D20735878 Pulled By: ezyang fbshipit-source-id: 392a664b0a9e74222b1a03b8c3f6ebb2c61d867e	2020-03-30 07:53:37 -07:00
peter	45c9ed825a	Formatting cmake (to lowercase without space for if/elseif/else/endif) (#35521 ) Summary: Running commands: ```bash shopt -s globstar sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i caffe2//CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i torch//CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i c10//CMakeLists.txt sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake//.cmake sed -e 's/IF (/if(/g' -e 's/IF(/if(/g' -e 's/if (/if(/g' -e 's/ELSE (/else(/g' -e 's/ELSE(/else(/g' -e 's/else (/else(/g' -e 's/ENDif(/endif(/g' -e 's/ELSEif(/elseif(/g' -i cmake//.cmake.in ``` We may further convert all the commands into lowercase according to the following issue: `77543bde41`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/35521 Differential Revision: D20704382 Pulled By: malfet fbshipit-source-id: 42186b9b1660c34428ab7ceb8d3f7a0ced5d2e80	2020-03-27 14:25:17 -07:00
Jongsoo Park	a7fe200f5f	[caffe2] simplify caffe2 code with fbgemm handling block size 1 emb (#33774 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/33774 Simplify caffe2 code using D19246900 Test Plan: CI Reviewed By: jianyuh Differential Revision: D20102410 fbshipit-source-id: 8de4d9cfac66898db0718ac6477339fd5e5428e3	2020-02-27 14:45:28 -08:00
Jongsoo Park	c57f8984e6	[caffe2] make order btw div and mul in adgrad consistent (#32974 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32974 Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/286 Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad. The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit) Test Plan: CI Reviewed By: wx1988 Differential Revision: D19342865 fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341	2020-02-16 22:45:59 -08:00
Jianyu Huang	a840afbeb4	[pytorch][embeddingbag_8bit] Add include_last_offset option to Fused 8bit EmbeddingBag and parallelize the op (#32683 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/32683 Pull Request resolved: https://github.com/pytorch/glow/pull/4079 Similar to D17768404, we changed the EmbeddingBag operator for 8-bit fused version to add the option to include the last offset and parallelize the op. ghstack-source-id: 97404645 Test Plan: To generate the AVX2 code (`embedding_lookup_fused_8bit_rowwise_idx_avx2.cc`): ``` python hp_emblookup_codegen.py --fused --use-offsets ``` To test the correctness: ``` buck test //caffe2/torch/fb/sparsenn:test -- test_embedding_bag_byte_rowwise_offsets --print-passing-details ``` Reviewed By: yinghai Differential Revision: D19592761 fbshipit-source-id: f009d675ea3f2228f62e9f86b7ccb94700a0dfe0	2020-01-29 16:04:56 -08:00
Jianyu Huang	3ada2e0d64	[pytorch][embeddingbag] Parallelize the EmbeddingBag operator (#4049 ) Summary: Pull Request resolved: https://github.com/pytorch/glow/pull/4049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/27477 We would like to add the intra-op parallelization support for the EmbeddingBag operator. This should bring speedup for the DLRM benchmark: https://github.com/pytorch/pytorch/pull/24385 Benchmark code: ``` from __future__ import absolute_import, division, print_function, unicode_literals import torch import time eb = torch.nn.EmbeddingBag(1000000, 64, mode='sum') input = torch.LongTensor(1500).random_(0, 1000000) offsets = torch.zeros(64, dtype=torch.int64) niter = 10000 s = time.time() for _ in range(niter): out = eb(input, offsets) time_per_iter = (time.time() - s) / niter print('time_per_iter', time_per_iter) print('GB/s', (input.numel() * 64 * 4 + out.numel() * 4) / time_per_iter / 1e9) ``` The following results are single core on Skylake T6: - Before our change (with the original caffe2::EmbeddingLookup) time_per_iter 6.313693523406982e-05 GB/s 6.341517821789133 - After our change using the EmbeddingLookupIdx API which takes the offsets instead of lengths. time_per_iter 5.7627105712890626e-05 GB/s 6.947841559053659 - With Intel's PR: https://github.com/pytorch/pytorch/pull/24385 time_per_iter 7.393271923065185e-05 GB/s 5.415518381664018 For multi-core performance, because Clang doesn't work with OMP, I can only see the single-core performance on SKL T6. ghstack-source-id: 97124557 Test Plan: With D16990830: ``` buck run mode/dev //caffe2/caffe2/perfkernels:embedding_bench ``` With D17750961: ``` buck run mode/opt //experimental/jianyuhuang/embeddingbag:eb buck run mode/opt-lto //experimental/jianyuhuang/embeddingbag:eb ``` OSS test ``` python run_test.py -i nn -- TestNNDeviceTypeCPU.test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu ``` Buck test ``` buck test mode/dev-nosan //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets_cpu" OMP_NUM_THREADS=3 buck test mode/opt -c pytorch.parallel_backend=tbb //caffe2/test:nn -- "test_EmbeddingBag_per_sample_weights_and_new_offsets" --print-passing-details ``` Generate the AVX2 code for embedding_lookup_idx_avx2.cc: ``` python hp_emblookup_codegen.py --use-offsets ``` Differential Revision: D17768404 fbshipit-source-id: 8dcd15a62d75b737fa97e0eff17f347052675700	2020-01-23 21:29:44 -08:00
Brian Wignall	f326045b37	Fix typos, via a Levenshtein-type corrector (#31523 ) Summary: Should be non-semantic. Uses https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings/For_machines to find likely typos, with https://github.com/bwignall/typochecker to help automate the checking. Uses an updated version of the tool used in https://github.com/pytorch/pytorch/pull/30606 . Pull Request resolved: https://github.com/pytorch/pytorch/pull/31523 Differential Revision: D19216749 Pulled By: mrshenli fbshipit-source-id: 7fd489cb9a77cd7e4950c1046f925d57524960ea	2020-01-17 16:03:19 -08:00
Hector Yuen	9e9ca6ec37	add conversion functions to embedding tables (#31083 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31083 add (fp32/fp16)<->(int8 rowwise quantized fp32/fp16 scale biases) Test Plan: added unit tests enhanced shape inference tests Reviewed By: jspark1105 Differential Revision: D18920547 fbshipit-source-id: 6b3d7cb93f9d1669ecf511817d73976177632891	2020-01-08 16:56:12 -08:00
Jongsoo Park	7a12ccd003	optimize FloatToFused8BitRowwiseQuantized and Fused8BitRowwiseQuantizedToFloat (#31470 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/31470 Optimize performance of these two operators. Additionally use nearbyint instead of round to be consistent with 4-bit embedding table quantization. Reviewed By: hyuen Differential Revision: D19072103 fbshipit-source-id: efe96f14aeff7958cceb453ed625d3fd693891ff	2019-12-20 10:09:26 -08:00
Jongsoo Park	e09c415387	Back out "make the order btw div and mul in adagrad update consistent" (#30737 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30737 Original commit changeset: 2a8b2a3f5401 Reverting this to be safe until we address test failures in T58528495 Test Plan: CI Reviewed By: wx1988 Differential Revision: D18812384 fbshipit-source-id: 2a3ac554024773022ec827f259127e4c8cffe6e2	2019-12-04 17:43:45 -08:00
Jongsoo Park	d32f261f16	make the order btw div and mul in adagrad update consistent (#30449 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/30449 There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. In this diff we first compute effective_lr = lr / (sqrt(moment) + epsilon) and then multiply with gradient. Test Plan: CI Reviewed By: protonu Differential Revision: D18703416 fbshipit-source-id: 2a8b2a3f5401466549561412bd22f07abac3c598	2019-12-02 13:53:38 -08:00
Jongsoo Park	649e7f057e	fix comment index_size->output_size (#29831 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/29831 As title. Thanks Aleks Zi for finding this! Test Plan: Just changing comments Reviewed By: zlateski Differential Revision: D18511259 fbshipit-source-id: 5f1ad9ba53db9b22622a556ec214ced361ec016a	2019-11-16 01:49:02 -08:00
James Donald	7f485121a6	Avoid MSVC _cvtsh_ss() workaround with clang-cl (#29726 ) Summary: We (me fnabulsi bmcdb) have a handful of fixes used locally to build and run with clang-cl. I am aware of https://github.com/pytorch/pytorch/issues/8784 but it has not been touched in almost a year. It may be more practical to upstream the non-controversial fixes piecewise. For example, this one. Here, the dummy version of `_cvtsh_ss` for MSVC is not required (and hence causes conflicts) when using clang-cl so can be #ifdef'd out. Pull Request resolved: https://github.com/pytorch/pytorch/pull/29726 Differential Revision: D18478120 Pulled By: ezyang fbshipit-source-id: cdcd94251e68347446f2ad1ac5a0e71089f7d0ab	2019-11-13 12:49:13 -08:00
Yinghai Lu	cddc147267	Back out "Revert D17826873: Adding support to offsets based Fused8BitRowwiseEmbeddingLookup" (#27728 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27728 Original commit changeset: 15ad64e49f92 Test Plan: same as previous one. Reviewed By: dreamingleo Differential Revision: D17872553 fbshipit-source-id: fd9d180d5e02e2c17285898c79cdd9509ffb8bbf	2019-10-10 23:52:43 -07:00
Ailing Zhang	b3cb072de7	Revert D17826873: Adding support to offsets based Fused8BitRowwiseEmbeddingLookup Test Plan: revert-hammer Differential Revision: D17826873 Original commit changeset: 23c4a96d9252 fbshipit-source-id: 15ad64e49f922a859abc574b261ac0f857682ff4	2019-10-10 16:16:06 -07:00
Yinghai Lu	ce6287f675	Adding support to offsets based Fused8BitRowwiseEmbeddingLookup (#27635 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/27635 PyTorch uses `offsets` instead of `lengths` for embedding table lookup. Adding support to that for fused quantized version. AVX2 version is generated with ``` python caffe2/caffe2/perfkernels/hp_emblookup_codegen.py --fused --use-offsets ``` Test Plan: ``` buck test caffe2/torch/fb/sparsenn:test ``` Reviewed By: jianyuh Differential Revision: D17826873 fbshipit-source-id: 23c4a96d92521deaebc02b688ad735d76a4476df	2019-10-10 10:50:44 -07:00
Jiakai Liu	67c530851c	get rid of protobuf dependencies (#25650 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25650 This PR removes protobuf dependencies from mobile build altogether: - caffe2/proto: protobuf files, including caffe2.proto and torch.proto; - caffe2 components that depend on caffe2.proto, including most part of caffe2/core, caffe2/utils; - libprotobuf / libprotobuf-lite dependencies; - protobuf compiler; - some utils class, e.g.: netdef_converter.cpp; - introduce a macro to disable third_party/onnx which depends on protobuf; Test Plan: - builds; - link with demo app to make sure it can load and run a model in pickle format; Differential Revision: D17183548 Pulled By: ljk53 fbshipit-source-id: fe60b48674f29c4a9b58fd1cf8ece44191491531	2019-09-06 08:48:20 -07:00
Jiakai Liu	2bed201190	remove caffe2.pb.h dependency for embedding_lookup_idx.cc (#25670 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/25670 This is part of the effort to get rid of protobuf dependency for libtorch mobile build. embedding_lookup_idx.cc is used by ATen/EmbeddingBag.cpp. It indirectly includes caffe2.pb.h but doesn't really need it. Clean up the headers to unblock no-protobuf mobile build. The broader problem is that many common headers in pytorch/caffe2 directly or indirectly include caffe2.pb.h. After landing the stack of changes to remove protobuf from OSS libtorch mobile build, it's going to constraint how ATen and other parts of pytorch use caffe2 components: it will break OSS mobile CI if a PR introduces a dependency to a caffe2 file that indirectly includes caffe2.pb.h. We will need to tease out caffe2.pb.h dependencies like in this diff, or do a refactor to replace protobuf generated types. Chatted with gchanan and ezyang to confirm that there is no plan to add more dependencies to caffe2 components from ATen in near future, so this should be fine. Test Plan: - build locally with stacked diffs Differential Revision: D17191913 Pulled By: ljk53 fbshipit-source-id: 1248fe6424060a8bedcf20e73942b7500ae5e815	2019-09-06 00:54:36 -07:00
Jianyu Huang	ad7250d315	Make EmbeddingLookup APIs take offsets rather than lengths to match the PyTorch's EmbeddingBag (#24944 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/24944 As Title says, we would like to make the EmbeddingLookup APIs take offsets rather than lengths to match the PyTorch's EmbeddingBag. ghstack-source-id: 88883902 Test Plan: python hp_emblookup_codegen.py --use-offsets Check the benchmark in D16990830. Reviewed By: jspark1105 Differential Revision: D16924271 fbshipit-source-id: 7fac640c8587db59fd2304bb8e8d63c413f27cb8	2019-08-23 14:43:56 -07:00
Jongsoo Park	431d6e2189	minor comment fix (#22140 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/22140 As title Reviewed By: protonu Differential Revision: D15966759 fbshipit-source-id: 15dbf9de60cced29055aeaac3b71c1ff41cfe1d4	2019-08-08 21:08:47 -07:00
Hector Yuen	26db46b324	change the epilogue of SLS to match the simd section (#21439 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21439 this bug got exposed after testing accuracy on shapes not multiples of 8 Reviewed By: jspark1105 Differential Revision: D15684759 fbshipit-source-id: 2950f2bd87ee1d8e539148285a14c755f606b3a7	2019-06-05 18:41:55 -07:00
Jongsoo Park	c185145d8c	remove dependency to caffe2::math and eigen (#21169 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/21169 We should minimize dependency from perfkernels (we were including eigen header files only in cc files not compiled with avx or avx2 options but better to be very strict because it's easy to introduce illegal instruction errors in perfkernels) Reviewed By: salexspb Differential Revision: D15563839 fbshipit-source-id: d4b1bca22d7f2e6f20f23664d4b99498e5984586	2019-05-31 11:55:16 -07:00
Jongsoo Park	101176870e	eliminate FE_INVALID exceptions related to fp16 conversion (#20390 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/20390 duc0 Ngo implemented observing floating point exceptions but there were a couple of places where we have "benign" floating point exceptions leading to false positives. This diff eliminates one source of such false positives, namely using _mm256_cvtph_ps and _mm256_cvtps_ph for partially uninitialized array for the remainder loop. Reviewed By: hx89 Differential Revision: D15307358 fbshipit-source-id: 38f57dfdd90c70bc693292d2f9c33c7ba558e2c9	2019-05-13 23:42:01 -07:00
James Reed	d17c22d024	Improve embedding_bag add kernel (#19329 ) Summary: This was actually getting pretty poor throughput with respect to memory bandwidth. I used this test to measure the memory bandwidth specifically for the AXPY call: https://gist.github.com/jamesr66a/b27ff9ecbe036eed5ec310c0a3cc53c5 And I got ~8 GB/s before this change, but ~14 GB/s after this change. This seems to speed up the operator overall by around 1.3x (benchmark: https://gist.github.com/jamesr66a/c533817c334d0be432720ef5e54a4166): == Before == time_per_iter 0.0001298875093460083 GB/s 3.082544287868467 == After == time_per_iter 0.00010104801654815674 GB/s 3.9623142905451076 The large difference between the local BW increase and the full-op BW increase likely indicates significant time is being spent elsewhere in the op, so I will investigate that. EDIT: I updated this PR to include a call into caffe2/perfkernels. This is the progression: before time_per_iter 8.983819484710693e-05 GB/s 4.456723564864611 After no axpy time_per_iter 7.19951868057251e-05 GB/s 5.56126065872172 AFter perfkernels time_per_iter 5.6699180603027346e-05 GB/s 7.061548257694262 After perfkernels no grad time_per_iter 4.388842582702637e-05 GB/s 9.122769670026413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/19329 Reviewed By: dzhulgakov Differential Revision: D14969630 Pulled By: jamesr66a fbshipit-source-id: 42d1015772c87bedd119e33c0aa2c8105160a738	2019-04-19 19:16:24 -07:00
Xiaomeng Yang	265fa0ce4d	Move math::Axpy function to elementwise lib (#18316 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/18316 Move math::Axpy function to elementwise lib i-am-not-moving-c2-to-c10 Reviewed By: houseroad Differential Revision: D14574697 fbshipit-source-id: 7cfbb2da295c8966c5328bd6b577cce2638eea62	2019-03-26 12:19:19 -07:00
Jongsoo Park	e21aa16931	more careful use of auto in sparse operations (#17958 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17958 In some places, we need 64-bit for corner cases even though it's going to be rare. In some places, we were using 64-bit unnecessarily. Reviewed By: hyuen Differential Revision: D14435523 fbshipit-source-id: e01ab73029ff780133af7ff4bbbe2e17926ed5a2	2019-03-14 22:10:42 -07:00
Jongsoo Park	1d522598fb	use fp16<->fp32 intrinsic (#17496 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/17496 As title. Reviewed By: hyuen Differential Revision: D14222907 fbshipit-source-id: d5d6c032e725ca8b52aca2be7401ec3c59f6a242	2019-03-07 02:23:07 -08:00
Jongsoo Park	db121375e7	more careful use of inline/template function in perfkernels (#15388 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15388 This is another pass to make perfkernels code safer from illegal instruction error. Removed dependency to c10/util/Logging.h We're err on the safer side at the expense of some verbosity. Reviewed By: dskhudia Differential Revision: D13502902 fbshipit-source-id: 4f833115df885c5b4f8c1ca83b9badea1553f944	2019-01-30 22:49:37 -08:00
Edward Yang	b9b160d86f	Remove ATen/Half.h and ATen/core/Half.h forwarding headers. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16115 Reviewed By: bddppq Differential Revision: D13717049 fbshipit-source-id: fb1d690183a932a1fa1a2d235f3219520f51620a	2019-01-18 10:55:21 -08:00
Tongliang Liao	55511004d1	Resolve errors in perfkernel for Windows (#16031 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/16031 1. MSVC only has _mm_prefetch(const char*, int). Fixed in both python codegen and C++ files. 2. uint32_t in "cvtsh_ss_bugfix.h" requires "#include <cstdint>". 3. Some files use gflags headers. Add dependency via c10. 4. Isolate arch flags with interface library and private compile options. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15753 Reviewed By: dskhudia Differential Revision: D13636233 Pulled By: jspark1105 fbshipit-source-id: cdcbd4240e07b749554a2a5676c11af88f23c31d	2019-01-16 21:51:00 -08:00
Mickaël Schoentgen	71c6e24373	Fix several ResourceWarning: unclosed file (#15746 ) Summary: Hello, This is a patch to fix `ResourceWarning: unclosed file`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/15746 Differential Revision: D13587286 Pulled By: soumith fbshipit-source-id: 08ac34c5b51d9334867f65a2927bff11511553f3	2019-01-09 15:36:53 -08:00
Jongsoo Park	e012b183dd	handle empty inputs to SparseLengthsMean correctly (#15389 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/15389 SparseLengthsMean was generating uninitialized data for empty inputs (lengths == 0). We should return zeros. The unit tests were also not covering this special case which is fixed by this diff. Reviewed By: salexspb Differential Revision: D13515970 fbshipit-source-id: 3c35265638f64f13f0262cee930c94f8628005da	2018-12-21 22:20:14 -08:00
Jongsoo Park	1e0eab5df8	minimize header file includes from _avx2.cc (#14950 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14950 Minimize the number of headers included from _avx2.cc files to avoid accidental compilation of functions defined the header files reused by other translation units that can lead to illegal instruction errors. Reviewed By: dskhudia Differential Revision: D13394483 fbshipit-source-id: 67149a6fb51f7f047e745bfe395cb6dd4ae7c1ae	2018-12-13 00:18:11 -08:00
Zachary DeVito	92314c83fa	re-enable copy of python files, but be careful that the copy is only … (#14982 ) Summary: …done once This allow no-op build to work correctly even when BUILD_CAFFE2_OPS is on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/14982 Differential Revision: D13413960 Pulled By: zdevito fbshipit-source-id: 6e5412a8c375af8a47c76f548cdd31cff15f3853	2018-12-11 16:54:08 -08:00
Jongsoo Park	0573ef664e	include avx512vl to avx512 code path (#14733 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14733 We often also want to use AVX512VL instruction sets. We already included AVX512F, AVX512DQ. Skylake also has AVX512BW, AVX512CD we may want to later. Reviewed By: duc0 Differential Revision: D13317282 fbshipit-source-id: 82c8e401d82d5c3a5452fb4ccb6e5cb88d242bda	2018-12-05 00:50:51 -08:00
Jongsoo Park	b5181ba1df	add avx512 option (but no avx512 kernel yet) (#14664 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14664 This diff just adds a framework to add avx512 kernels. Please be really really careful about using avx512 kernels unless you're convinced using avx512 will bring good enough overall speedups because it can backfire because of cpu frequency going down. Reviewed By: duc0 Differential Revision: D13281944 fbshipit-source-id: 04fce8619c63f814944b727a99fbd7d35538eac6	2018-12-03 12:18:19 -08:00
Jongsoo Park	5c89190340	inline adagrad functions (#14194 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14194 Inline some of perfkernels/adagrad.h functions for better performance Reviewed By: hyuen Differential Revision: D13096351 fbshipit-source-id: b4da8053278d585eabc5389b8a8dcae0f253b413	2018-12-02 20:23:02 -08:00
Jongsoo Park	c784f847de	fix sparse_adagrad param_size overflow error (#14049 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/14049 param_size should be passed as int64_t Reviewed By: hyuen Differential Revision: D13090511 fbshipit-source-id: 7892d315d7c82c7d7ca103fb36d30cdf1fe24785	2018-11-16 18:53:32 -08:00
Jongsoo Park	4b86a215ca	moving simd adagrad code to perfkernels (#13549 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/13549 caffe2/perfkernels has a nice framework to switch btw implementations optimized for different instructions at runtime. This can be a good preparation to implement avx512 adagrad kernels. Reviewed By: hyuen Differential Revision: D12882872 fbshipit-source-id: a8f0419f6a9fd4e9b864c454dad0a80db267190c	2018-11-11 00:20:39 -08:00
Yangqing Jia	a6f1ae7f20	set up c10 scaffolding. Move macros proper first. Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11939 Reviewed By: orionr, dzhulgakov Differential Revision: D10004629 Pulled By: Yangqing fbshipit-source-id: ba50a96820d35c7922d81c78c4cbe849c85c251c	2018-09-24 11:09:59 -07:00
Christian Puhrsch	a6630e25af	Remove many caffe2::TIndex and replace them with int64_t (#11943 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11943 See title Reviewed By: ezyang Differential Revision: D9992645 fbshipit-source-id: e8f80d6ea762971513e5e8072975ceea53e1f11a	2018-09-22 18:11:04 -07:00
Roy Li	30521a37ad	codemod: caffe::float16 -> at::Half (#11785 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/11785 Replace each instead of float16 with Half. Reviewed By: Yangqing Differential Revision: D9892158 fbshipit-source-id: b9225ca7bd5c84fd1c04a9d24b026c8b6cbff120	2018-09-20 18:55:19 -07:00

1 2

76 Commits