pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	3b709d7c1e	Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 )" This reverts commit `015bd0e0a1`. Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))	2023-12-26 23:47:15 +00:00
Yanbo Liang	015bd0e0a1	[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 ) After this refactor: * ```TorchVariable``` definition and all references are removed. * All ```is_allowed``` references except one are removed. - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312 Approved by: https://github.com/jansel	2023-12-23 09:44:09 +00:00
Mikayla Gawarecki	f206e31e2f	Swap slots if slots match in swap_tensor (#116128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128 Approved by: https://github.com/albanD	2023-12-21 00:43:30 +00:00
Kurt Mohler	8a8d0adc0b	Fix `troch.gradient` check for spacing arg list length (#115686 ) Fixes #114207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686 Approved by: https://github.com/albanD	2023-12-13 20:17:20 +00:00
mantaionut	d521857411	Terminate handler (#101332 ) Fixes #50051. This PR is based on #50320 and I address the last feedback. On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable. This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads. If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h> The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-12 17:55:27 +00:00
ecao	65651d970b	Optimize the copy of Half to Float and Float to Half on CPU (#103148 ) ### Description Optimize the copy of Half to Float and Float to Half on CPU. ### Testing Single core: shape \| fp16 -> fp32 / ms \| fp32 -> fp16 / ms \| bf16 -> fp32 / ms \| fp32 -> bf16 / ms -- \| -- \| -- \| -- \| -- size: (1, 777) \| 0.00345 \| 0.00344 \| 0.00411 \| 0.00410 size: (2, 512) \| 0.00355 \| 0.00344 \| 0.00431 \| 0.00400 size: (10, 555) \| 0.00473 \| 0.00391 \| 0.00562 \| 0.00477 size: (1, 2048, 1024) \| 0.488 \| 0.480 \| 0.498 \| 0.499 size: (32, 100, 777) \| 0.584 \| 0.568 \| 0.571 \| 0.587 28 cores: shape \| fp16 -> fp32 / ms \| fp32 -> fp16 / ms \| bf16 -> fp32 / ms \| fp32 -> bf16 / ms -- \| -- \| -- \| -- \| -- size: (10, 555) \| 0.00472 \| 0.00369 \| 0.00576 \| 0.00481 size: (1, 2048, 1024) \| 0.0189 \| 0.0188 \| 0.0173 \| 0.0251 size: (64, 512, 1024) \| 3.159 \| 2.375 \| 3.152 \| 2.358 size: (32, 100, 777) \| 0.0225 \| 0.0195 \| 0.0193 \| 0.0261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103148 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-12 05:57:52 +00:00
FFFrog	3361496f96	Fix the corner case of index_add (#114929 ) Fixes #114864 As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114929 Approved by: https://github.com/mikaylagawarecki	2023-12-09 01:57:25 +00:00
albanD	a2b89154bf	New swap function (#111747 ) This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem. The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references. This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up. This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs). Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots. The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses. Draft right now to see what @colesbury thinks about doing this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747 Approved by: https://github.com/colesbury	2023-12-08 18:49:35 +00:00
Kurt Mohler	6f32eb7eef	Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 ) Fixes #95578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590 Approved by: https://github.com/peterbell10	2023-12-01 18:56:09 +00:00
PyTorch MergeBot	013675ff59	Revert "Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 )" This reverts commit `f1286161a6`. Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job. The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))	2023-11-30 02:28:14 +00:00
Kurt Mohler	f1286161a6	Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 ) Fixes #95578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590 Approved by: https://github.com/peterbell10	2023-11-29 21:50:46 +00:00
PyTorch MergeBot	fe428a284b	Revert "Add `torch._lazy_clone` to create COW tensors (#113397 )" This reverts commit `9916d8a9ea`. Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))	2023-11-20 10:21:09 +00:00
PyTorch MergeBot	d40d72d664	Revert "Skip test_lazy_clone for Inductor (#114012 )" This reverts commit `ecd8d388b9`. Reverted https://github.com/pytorch/pytorch/pull/114012 on behalf of https://github.com/DanilBaibak due to I revert the PR due to the original changes broke the internal build. Here is the original diff stack [D51444337](https://www.internalfb.com/diff/D51444337) ([comment](https://github.com/pytorch/pytorch/pull/114012#issuecomment-1818745425))	2023-11-20 10:12:44 +00:00
Nikita Shulga	ecd8d388b9	Skip test_lazy_clone for Inductor (#114012 ) As half of those tests fail if run individually, but first failure masks all subsequent ones, i.e. ``` PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -v -k test_lazy_clone_cuda_float32 test_lazy_clone_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) ... FAIL ... self.assertTrue(torch._C._is_cow_tensor(t)) AssertionError: False is not true ---------------------------------------------------------------------- Ran 1 test in 19.419s FAILED (failures=1) ``` But ``` $ PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -k test_lazy_clone_ ... ...................... ---------------------------------------------------------------------- Ran 24 tests in 24.969s OK ``` This flaky behavior was already detected, for example see https://github.com/pytorch/pytorch/issues/113953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114012 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-11-18 04:57:00 +00:00
Kurt Mohler	9916d8a9ea	Add `torch._lazy_clone` to create COW tensors (#113397 ) Part of #109833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397 Approved by: https://github.com/ezyang ghstack dependencies: #113396	2023-11-17 01:58:51 +00:00
Brian Hirsh	cebad9867b	graph break on intermediate leaves that require grad (#113277 ) fixes https://github.com/pytorch/pytorch/issues/90552. This is a simpler fix that just detects the situation where AOTAutograd can't create a proper backward graph for the situation and graph breaks. This was technically a silent correctness issue before. This PR tries to always graph break when we see a factory function that returns a tensor requiring grad. I check this by seeing if the op returned a `TensorVariable` in dynamo, and if one of the input arguments was a `requires_grad=True` kwarg. I think this is high-fidelity enough, and I'm also hoping that this is uncommon enough that a graph break is reasonable here. The fix to avoid the graph break in user land is also pretty easy - just instantiate your tensor outside of the compiled region and plumb it in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113277 Approved by: https://github.com/eellison ghstack dependencies: #113267, #113416, #113584	2023-11-16 02:47:45 +00:00
Nikita Shulga	78f3937ee8	[BE] Handle errors in `set_num_threads` (#113684 ) and `set_num_interop_threads` Before that, call `torch.set_num_threads(265)` resulted in segmentation fault, afterwards it becomes a good old runtime error: ``` % python -c "import torch;torch.set_num_threads(265)" Traceback (most recent call last): File "<string>", line 1, in <module> RuntimeError: Overflow when unpacking long ``` Similar to https://github.com/pytorch/pytorch/pull/60073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113684 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-15 06:17:41 +00:00
Kurt Mohler	8bdce9bb74	Fix `UntypedStorage.resize_` to keep same CUDA device index (#113386 ) Fixes #113300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113386 Approved by: https://github.com/albanD	2023-11-10 01:57:25 +00:00
Kurt Mohler	fd209543d5	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 ) Part of #109802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377 Approved by: https://github.com/albanD, https://github.com/aaronenyeshi	2023-11-01 16:10:09 +00:00
PyTorch MergeBot	ace2713d1e	Revert "Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 )" This reverts commit `f1785373c0`. Reverted https://github.com/pytorch/pytorch/pull/111377 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111377#issuecomment-1784179040))	2023-10-29 17:41:55 +00:00
Nikita Shulga	b61efe1c2b	Fix `torch.[size\|stride]`(dim=None)` invocation (#111991 ) Per documentation, one should be able to explicitly pass dim argument as None to get tensor size across all dimentions/strides, but before this change it was incorrectly interpreted as named tensor call. Modify `size` and `stride` signatures generated by `gen_pyi.py` to highlight that overload with `None` will return a Tuple, but one with `dim: _int` returns `int`. Add regression test to validate the behavior, and remove the check for asserts from two named tensors tests (NamedTensors are dead, aren't they?) Fixes https://github.com/pytorch/pytorch/issues/111944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111991 Approved by: https://github.com/zou3519	2023-10-26 04:14:35 +00:00
Kurt Mohler	f1785373c0	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 ) Part of #109802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377 Approved by: https://github.com/albanD	2023-10-26 02:39:06 +00:00
Nikita Shulga	7709382b50	Fix regression in `torch.equal` behavior for NaNs (#111699 ) `torch.equal(x, x)` should return false if one of `x` is a tenor of floats one of which is NaN. So, it renders some of the optimization proposed in https://github.com/pytorch/pytorch/pull/100024 invalid, though as result `torch.equal` will become much slower for identical floating point tensors. Add regression test that calls torch.equal for tensor containing NaN Fixes https://github.com/pytorch/pytorch/issues/111251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111699 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-10-21 00:02:45 +00:00
CaoE	d1afb7d43d	add Half support for multinomial on CPU (#104178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178 Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch	2023-10-20 19:16:04 +00:00
Evgeni Burovski	48989bc820	trace frames with np.ndarray (#110512 ) Fixes #109604 Resubmit gh-109715 + several skips and small fixes to make tests pass. The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break. While at it, fix several small issues Yukio's fix uncovers: - graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc) - recognize array scalars in dynamo, treat them as 0D ndarrays - make sure that iterating over torch.ndarray generates arrays not bare tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512 Approved by: https://github.com/lezcano	2023-10-15 00:56:10 +00:00
CaoE	8713a1a363	add Half support for bernoulli on CPU (#104176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176 Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch	2023-10-13 01:18:55 +00:00
Prachi Gupta	53a9ac534c	Added decorator `skipRocmIfTorchInductor` and skipped failing tests (#107760 ) This PR adds a skip decorator which will disable tests in CI for ROCm inductor workflow. This new workflow will be coming in via https://github.com/pytorch/pytorch/pull/110544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107760 Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/atalman	2023-10-12 16:00:35 +00:00
Elias Ellison	cf1da9bd17	enable index add test (#111016 ) Dynamo is swallowing a user exception when suppress_errors is set to True. There's an issue filed for that: https://github.com/pytorch/pytorch/issues/108798. In the meantime we still like the functionality in this test which works without the default setting (dont suppress errors) to not regress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111016 Approved by: https://github.com/yanboliang	2023-10-11 19:41:35 +00:00
eellison	fb4b9e9c8e	Re-enable a couple of fixed tests (#110770 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110770 Approved by: https://github.com/yanboliang, https://github.com/int3, https://github.com/Skylion007 ghstack dependencies: #110651	2023-10-10 19:13:14 +00:00
eellison	c5f06b9753	Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651 ) - neg view can just be lowered to neg() post functionalization - we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651 Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007	2023-10-10 16:34:01 +00:00
jjsjann123	37567fdf31	Nvfuser cpp api deprecation attempt 2 (#110881 ) attempting to re-try #110318 deprecating nvfuser c++ API warning has been updated to TORCH_WARN_ONCE; Warning thrown inside torch::jit::fuser::cuda::isEnabled() is turned off and will be deprecated when we pulled out TorchScript integration in the follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110881 Approved by: https://github.com/davidberard98, https://github.com/NicolasHug	2023-10-10 08:07:03 +00:00
PyTorch MergeBot	bbdc8c7b05	Revert "deprecating nvfuser c++ API (#110318 )" This reverts commit `bf0866fc16`. Reverted https://github.com/pytorch/pytorch/pull/110318 on behalf of https://github.com/davidberard98 due to too many warnings being thrown in torchvision https://github.com/pytorch/pytorch/issues/110857 ([comment](https://github.com/pytorch/pytorch/pull/110318#issuecomment-1753245449))	2023-10-09 15:41:50 +00:00
jjsjann123	bf0866fc16	deprecating nvfuser c++ API (#110318 ) deprecating nvfuser c++ API Pull Request resolved: https://github.com/pytorch/pytorch/pull/110318 Approved by: https://github.com/davidberard98	2023-10-07 02:25:21 +00:00
eellison	3812f2e40c	Preserve layout on like constructors (#110242 ) Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor. This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242 Approved by: https://github.com/int3	2023-10-02 23:53:55 +00:00
Moritz Hennen	09c598745c	Rename `torch._C._TensorBase` to `TensorBase` (#109940 ) I have gone ahead and implemented the renaming of the type `torch._C._TensorBase` to a non-private class name `TensorBase`. The changes also include leaving `torch._C._TensorBase` as an alias to the new type: `70458768fb/torch/csrc/autograd/python_variable.cpp (L2196-L2197)` both in the c++ code and in the corresponding `__init__.pyi.in` file: `70458768fb/torch/_C/__init__.pyi.in (L1522)` Fixes #109438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109940 Approved by: https://github.com/ezyang	2023-09-25 19:10:22 +00:00
Jez Ng	063a62622b	Add memory overlap check to `meta_copy_` (#108989 ) Fixes `test_copy_many_to_one`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108989 Approved by: https://github.com/eellison	2023-09-12 23:28:14 +00:00
Kurt Mohler	4c5e43574c	Reland 2: Add PyObject preservation for UntypedStorage (#109039 ) Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039 Approved by: https://github.com/ezyang	2023-09-12 22:26:05 +00:00
PyTorch MergeBot	41bd0fde7e	Revert "Remove fixed skips (#108674 )" This reverts commit `ab9fb03d6f`. Reverted https://github.com/pytorch/pytorch/pull/108674 on behalf of https://github.com/huydhn due to Sorry for picking this up a bit late, but with https://github.com/pytorch/pytorch/pull/108647 reverted, these tests are failing again. So we need to wait for the PR to reland before we can land this change ([comment](https://github.com/pytorch/pytorch/pull/108674#issuecomment-1715202692))	2023-09-12 08:04:32 +00:00
PyTorch MergeBot	59f605be57	Revert "Reland 2: Add PyObject preservation for UntypedStorage (#109039 )" This reverts commit `419e4e17a2`. Reverted https://github.com/pytorch/pytorch/pull/109039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing linter job in trunk, probably due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/109039#issuecomment-1715147020))	2023-09-12 07:26:11 +00:00
Kurt Mohler	419e4e17a2	Reland 2: Add PyObject preservation for UntypedStorage (#109039 ) Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039 Approved by: https://github.com/ezyang	2023-09-12 01:19:40 +00:00
Li-Huai (Allan) Lin	b2cba439b4	Introduce Tensor overload to linspace and logspace (#104889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889 Approved by: https://github.com/zou3519 ghstack dependencies: #107958	2023-09-11 23:30:40 +00:00
PyTorch MergeBot	a7f5abeade	Revert "Introduce Tensor overload to linspace and logspace (#104889 )" This reverts commit `57e5239321`. Reverted https://github.com/pytorch/pytorch/pull/104889 on behalf of https://github.com/clee2000 due to sorry have to revert this to revert https://github.com/pytorch/pytorch/pull/107958 ([comment](https://github.com/pytorch/pytorch/pull/104889#issuecomment-1714305768))	2023-09-11 17:33:48 +00:00
Li-Huai (Allan) Lin	57e5239321	Introduce Tensor overload to linspace and logspace (#104889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104889 Approved by: https://github.com/zou3519 ghstack dependencies: #107958	2023-09-11 15:29:39 +00:00
Edward Z. Yang	137afe74e0	Don't fastpath conj copy when conj/neg bit mismatch (#108881 ) Fixes https://github.com/pytorch/pytorch/issues/106051 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/108881 Approved by: https://github.com/soulitzer	2023-09-08 20:44:43 +00:00
PyTorch MergeBot	68238606f3	Revert "Reland: Add PyObject preservation for UntypedStorage (#103907 )" This reverts commit `56b848157c`. Reverted https://github.com/pytorch/pytorch/pull/103907 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing torchdistx build which uses check_pyobj here `9c1b9f5cb2/src/python/torchdistx/_C/deferred_init.cc (L87)` ([comment](https://github.com/pytorch/pytorch/pull/103907#issuecomment-1712121158))	2023-09-08 19:27:07 +00:00
Evgeni Burovski	1f20531939	fall back to eager on `NotImplementedError` (#107863 ) Follow-up to https://github.com/pytorch/pytorch/pull/107710: Help dynamo fall back to eager when compiling unimplemented numpy constructs: - arrays of strings - (arg){min, max} for complex types - various arguments typed as NotImplemented (`np.ones(4, order="F")` etc) - numpy functions which torch._numpy does not implement To test, run (we do not implement arrays of strings) ``` import torch import numpy as np @torch.compile(fullgraph=False) def fn(): return np.asarray(["L", "U"]) ``` and observe it compiles with fullgraph=False and fails with fullgraph=True Fixes https://github.com/pytorch/pytorch/issues/107970 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107863 Approved by: https://github.com/ezyang, https://github.com/lezcano	2023-09-07 21:22:20 +00:00
eellison	ab9fb03d6f	Remove fixed skips (#108674 ) These no longer fail with TEST_WITH_TORCHINDUCTOR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108674 Approved by: https://github.com/desertfire	2023-09-07 17:36:56 +00:00
Kurt Mohler	56b848157c	Reland: Add PyObject preservation for UntypedStorage (#103907 ) This relands #97470 after #102553 reverted it. This PR attempts to fix the internal failure by avoiding an unnecessary intermediate storage buffer allocation in `c10::newStorageImplFromRefcountedDataPtr`. Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103907 Approved by: https://github.com/ezyang	2023-09-07 04:24:11 +00:00
Kurt Mohler	3f88e3105f	Reland: Remove remaining global `set_default_dtype` calls from tests (#108088 ) Fixes #68972 Relands #107246 To avoid causing Meta-internal CI failures, this PR avoids always asserting that the default dtype is float in the `TestCase.setUp/tearDown` methods. Instead, the assert is only done if `TestCase._default_dtype_check_enabled == True`. `_default_dtype_check_enabled` is set to True in the `if __name__ == "__main__":` blocks of all the relevant test files that have required changes for this issue Pull Request resolved: https://github.com/pytorch/pytorch/pull/108088 Approved by: https://github.com/ezyang	2023-09-07 03:04:34 +00:00
PyTorch MergeBot	43527d41a2	Revert "Remove fixed skips (#108674 )" This reverts commit `518cfda2dd`. Reverted https://github.com/pytorch/pytorch/pull/108674 on behalf of https://github.com/huydhn due to Sorry for reverting this, but one test is failing on inductor `518cfda2dd`, and it seems easier to revert this than disabling the test ([comment](https://github.com/pytorch/pytorch/pull/108674#issuecomment-1709310192))	2023-09-07 00:56:46 +00:00
eellison	518cfda2dd	Remove fixed skips (#108674 ) These no longer fail with TEST_WITH_TORCHINDUCTOR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108674 Approved by: https://github.com/desertfire	2023-09-06 22:33:43 +00:00
PyTorch MergeBot	161ea463e6	Revert "Remove remaining global `set_default_dtype` calls from tests (#107246 )" This reverts commit `aa8ea1d787`. Reverted https://github.com/pytorch/pytorch/pull/107246 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/107246#issuecomment-1693838522))	2023-08-25 19:34:55 +00:00
Digant Desai	8a7a6867b9	[PyTorch][Tensor] Introduce tensor.dim_order (#106835 ) Summary: This is a stride based attribute for a tensor available in Python. This can help inspect tensors generated using `torch.empty_permuted(.., physical_layout, ...)`, where physical_layout should match the dim_order returned here. `empty_permuted` will be renamed to use dim_order as the param name in the future. And also help Executorch export pipeline with implementing dim_order based tensors. Differential Revision: D48134476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106835 Approved by: https://github.com/ezyang	2023-08-25 00:06:03 +00:00
Kurt Mohler	aa8ea1d787	Remove remaining global `set_default_dtype` calls from tests (#107246 ) Fixes #68972 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107246 Approved by: https://github.com/ezyang	2023-08-24 16:10:48 +00:00
Aaron Gokaslan	660e8060ad	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-22 23:16:38 +00:00
PyTorch MergeBot	d59a6864fb	Revert "[BE]: Update ruff to 0.285 (#107519 )" This reverts commit `88ab3e4322`. Reverted https://github.com/pytorch/pytorch/pull/107519 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR breaks internal tests. @ezyang, can you please hep them get unblocked? It seems like one of the strings was prob accidentally modified ([comment](https://github.com/pytorch/pytorch/pull/107519#issuecomment-1688833480))	2023-08-22 19:53:32 +00:00
Aaron Gokaslan	88ab3e4322	[BE]: Update ruff to 0.285 (#107519 ) This updates ruff to 0.285 which is faster, better, and have fixes a bunch of false negatives with regards to fstrings. I also enabled RUF017 which looks for accidental quadratic list summation. Luckily, seems like there are no instances of it in our codebase, so enabling it so that it stays like that. :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/107519 Approved by: https://github.com/ezyang	2023-08-20 01:36:18 +00:00
Catherine Lee	bc053070f8	Mark test_gradient_extreme_cases as slow for inductor (#107189 ) test_gradient_extreme_cases_* takes ~5 minutes on the inductor sm86 shard and possibly even longer on the inductor workflow since it's timing out right now although I'm not sure what the difference between the two is, and sometimes auto slow test detection isn't catching it Pull Request resolved: https://github.com/pytorch/pytorch/pull/107189 Approved by: https://github.com/ZainRizvi	2023-08-15 22:03:00 +00:00
Sam Larsen	3d00170b20	[inductor] fix test_dim_function_empty (#106994 ) Summary: Looks like the assert syntax was just wrong Test Plan: PYTORCH_TEST_WITH_INDUCTOR=1 python test/test_torch.py -k test_dim_function_empty PYTORCH_TEST_WITH_AOT_EAGER=1 python test/test_torch.py -k test_dim_function_empty Pull Request resolved: https://github.com/pytorch/pytorch/pull/106994 Approved by: https://github.com/eellison	2023-08-11 21:38:53 +00:00
Kshiteej K	a899333ffc	fix: nll_loss batch rule with negative ignore_idx (#106118 ) We use python decompositions instead of writing our own for batching rules. Fixes https://github.com/pytorch/pytorch/issues/105736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/106118 Approved by: https://github.com/lezcano, https://github.com/zou3519	2023-08-04 07:43:02 +00:00
Fuzzkatt	ae1c0f42a3	update tf32 thresholds for H100 (#105879 ) Addresses tf32 threshold related failures from NVIDIA internal testing for following unit tests: H100: - test_nn.py: test_ConvTranspose2d_dilated_cuda_tf32, test_ConvTranspose2d_no_bias_cuda_tf32, test_Transformer_multilayer_coder_cuda_tf32 - test_torch.py: test_cdist_non_contiguous_batch Pull Request resolved: https://github.com/pytorch/pytorch/pull/105879 Approved by: https://github.com/ezyang	2023-08-02 16:44:01 +00:00
Scott Wolchok	b435bff53a	[PyTorch] Add tests for empty tensors w/storage null data_ptr (#101426 ) Further investigation seems to show that changing this behavior (making empty tensors sometimes have non-null data_ptr) was the real problem with #98090 . Adding tests to lock down this behavior so we don't change it by accident again. Differential Revision: [D45873002](https://our.internmc.facebook.com/intern/diff/D45873002/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/101426 Approved by: https://github.com/zou3519	2023-07-27 05:19:42 +00:00
Nikita Karetnikov	eac9e1b35f	[OpInfo] add reference and error inputs for `multilabel_margin_loss` (#105523 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105523 Approved by: https://github.com/ezyang	2023-07-23 02:16:29 +00:00
Justin Chu	4cc1745b13	[BE] f-stringify torch/ and scripts (#105538 ) This PR is a follow up on the pyupgrade series to convert more strings to use f-strings using `flynt`. - https://docs.python.org/3/reference/lexical_analysis.html#f-strings - https://pypi.org/project/flynt/ Command used: ``` flynt torch/ -ll 120 flynt scripts/ -ll 120 flynt tools/ -ll 120 ``` and excluded `collect_env.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105538 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-07-21 19:35:24 +00:00
Justin Chu	73e1455327	[BE] Enable ruff's UP rules and autoformat test/ (#105434 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105434 Approved by: https://github.com/albanD	2023-07-19 20:36:06 +00:00
Kurt Mohler	fcb7d4b358	Mark `bincount` CUDA deterministic if `weights` are not given (#105244 ) Fixes #98316 Pull Request resolved: https://github.com/pytorch/pytorch/pull/105244 Approved by: https://github.com/mikaylagawarecki	2023-07-18 01:16:51 +00:00
Nikita Shulga	5837e95d30	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Unrelated, to bypass CI failures due to the gcc9 dependency update in Ubuntu-18.04: - Add hack to squash older libstdc++ from conda environment in favor one from OS to `.ci/docker/install_conda.sh` - Update bazel cuda builds to focal, as with libstdc++-6.0.32 bazel builds loose the ability to catch exceptions (probably because they link with cupti statically, but I could not found where it is done) Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-15 20:30:20 +00:00
PyTorch MergeBot	15fd1ea118	Revert "[Reland] Update mypy to 1.4.1 (#105227 )" This reverts commit `c9c4f8efc3`. Reverted https://github.com/pytorch/pytorch/pull/105227 on behalf of https://github.com/atalman due to trying to mitigate ci sev #105248 ([comment](https://github.com/pytorch/pytorch/pull/105227#issuecomment-1636510935))	2023-07-14 22:28:35 +00:00
Nikita Karetnikov	0c89596e4f	[OpInfo] add reference and error inputs for `multi_margin_loss` (#104850 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104850 Approved by: https://github.com/ezyang	2023-07-14 21:16:09 +00:00
Nikita Shulga	c9c4f8efc3	[Reland] Update mypy to 1.4.1 (#105227 ) This PR re-lands - [Typing] Fix PEP 484 Violation (#105022) - Update mypy to 1.4.1 (#91983) That were reverted due to the conflict with internal source repo. Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - Add assert it `torch/optim/optimizer.py` that Optional list is not None TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/105227 Approved by: https://github.com/atalman, https://github.com/albanD, https://github.com/Skylion007	2023-07-14 20:45:12 +00:00
PyTorch MergeBot	3c5a494d7a	Revert "Update mypy to 1.4.1 (#91983 )" This reverts commit `634659e262`. Reverted https://github.com/pytorch/pytorch/pull/91983 on behalf of https://github.com/malfet due to It's dependent change was reverted, so reverting this one as well, to keep CI clean ([comment](https://github.com/pytorch/pytorch/pull/91983#issuecomment-1636059709))	2023-07-14 15:59:16 +00:00
Kurt Mohler	f987d11fa7	Reland: Make `torch.empty*` deterministic by filling with NaN or max int (#104995 ) Relands #101849 after #104302 reverted it. torchrec PR https://github.com/pytorch/torchrec/pull/1269 fixes the torchrec failure that caused #101849 to be reverted Part of #82004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104995 Approved by: https://github.com/albanD	2023-07-13 22:18:03 +00:00
Nikita Shulga	634659e262	Update mypy to 1.4.1 (#91983 ) Mostly fixes for PEP-484 violation (i.e. when default arg is set to None, but type is not annotated as optional) Plus few real fixes: - Add missing `_get_upgraders_entry_map` to `torch/_C/__init__.pyi` - Add missing return statement to `torch._export. deserialize_graph` - Fix error message in `torch.ao.ns.fx.weight_utils.get_lstm_mod_weights` - TODO (in followup PR): - Fix erroneous `isinstance` check in `torch/ao/quantization/_pt2e/qat_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/91983 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/thiagocrepaldi, https://github.com/aaronenyeshi	2023-07-13 16:30:36 +00:00
yanbing-j	053654b9cf	Optimize scatter_add/scatter_reduce in BFloat16/Half data type in CPU backend (#103427 ) ### Description This PR is to optimize scatter_add/scatter_reduce of BFloat16/Half data type in CPU backend, which is one task in https://github.com/pyg-team/pytorch_geometric/issues/7057. Main point is creating a buffer among threads to accumulate intermediate data as fp32 data type. Next step: - [x] Add benchmarks - [x] Extend to Half - [x] Simplify code ### Performance test (Updated) Test BFloat16 in Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz With jemalloc and iomp Single socket (40C) ![image](https://github.com/pytorch/pytorch/assets/61222868/4b4342f1-8cc3-46f7-81f5-651becd9b1e3) Single core ![image](https://github.com/pytorch/pytorch/assets/61222868/09e5f700-2c2e-4208-979e-74b85474dea6) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103427 Approved by: https://github.com/mingfeima, https://github.com/albanD	2023-07-13 09:34:29 +00:00
Aaron Gokaslan	2f95a3d0fc	[BE]: Apply ruff PERF fixes to torch (#104917 ) Applies automated ruff fixes in the PERF modules and enables all automatic ones. I also updated ruff which applied some additional fixes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/104917 Approved by: https://github.com/ezyang, https://github.com/albanD	2023-07-11 20:45:21 +00:00
Kurt Mohler	0ccdbbe233	Add deterministic path for `Tensor.resize_` (#104300 ) New elements added to a tensor by `torch.Tensor.resize_` are set to NaN/MAX_INT when deterministic mode is turned on. When `torch.Tensor.resize_` is called on a quantized tensor and deterministic mode is turned on, a nondeterministic error is raised. Part of #82004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104300 Approved by: https://github.com/albanD	2023-07-07 00:22:13 +00:00
Nikita Shulga	ddd7da7546	Enable more tests (#104437 ) Remove `test_segment_reductions` from list of blocklisted tests Remove `@onlyCPU` qualifier from test_segment_reductions as it has CUDA specific parts Fixes https://github.com/pytorch/pytorch/issues/104410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104437 Approved by: https://github.com/atalman, https://github.com/huydhn	2023-06-30 16:26:11 +00:00
Amr Elshennawy	a78bddac01	Revert D46920584: Multisect successfully blamed D46920584 for test or build failures (#104269 ) (#104302 ) Summary: This diff is reverting D46920584 D46920584: Make `torch.empty*` deterministic by filling with NaN or max int value (#101849) by generatedunixname499836121 has been identified to be causing the following test or build failures: Tests affected: - [torchrec/distributed/composable/tests:test_fsdp - torchrec.distributed.composable.tests.test_fsdp.FullyShardTest: test_composable_checkpoint](https://www.internalfb.com/intern/test/281475062923125/) Here's the Multisect link: https://www.internalfb.com/multisect/2341386 Here are the tasks that are relevant to this breakage: We're generating a revert to back out the changes in this diff, please note the backout may land if someone accepts it. If you believe this diff has been generated in error you may Commandeer and Abandon it. Test Plan: NA Reviewed By: huydhn, osalpekar Differential Revision: D46997394 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104302 Approved by: https://github.com/osalpekar	2023-06-29 20:20:58 +00:00
Richard Barnes	8cad411d3d	Fix UntypedStorage pin error (#104355 ) Summary: Fixes: ``` TypeError: cannot pin 'torch.storage.UntypedStorage' only CPU memory can be pinned ``` Test Plan: Sandcastle Differential Revision: D47093797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/104355 Approved by: https://github.com/malfet	2023-06-29 16:06:52 +00:00
Kurt Mohler	2642f31e4c	Make `torch.empty*` deterministic by filling with NaN or max int value (#101849 ) Part of #82004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101849 Approved by: https://github.com/lezcano, https://github.com/albanD, https://github.com/kulinseth	2023-06-21 02:53:22 +00:00
Elias Ellison	40d70ba7ed	Remove a number of fixed skips (#103162 ) Also adds `PYTORCH_TEST_WITH_AOT_EAGER` to distinguish errors coming from aot_autograd and not inductor (not tested in ci, but useful for local debugging) Pull Request resolved: https://github.com/pytorch/pytorch/pull/103162 Approved by: https://github.com/desertfire	2023-06-08 17:37:59 +00:00
ts	d2d03f0f44	Make index_add_ error if input source shape is wrong (#100321 ) Fixes #92576 , checking the following as described in the documentation: "source.shape[dim] == len(index) and source.shape[i] == self.shape[i] for i != dim" Would be happy to iterate on this if there are any issues, and would be happy to implement the checking for the CUDA and MPS implementations of index_add_. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100321 Approved by: https://github.com/lezcano	2023-06-08 06:51:10 +00:00
Lu Fang	1237502213	Introduce fast path for cuda_equal (#102714 ) We introduce the same trick for cuda_equal. Assuming in cuda_equal, the flags are already handled correctly. Added the tests for cuda part. Pull Request resolved: https://github.com/pytorch/pytorch/pull/102714 Approved by: https://github.com/ezyang	2023-06-03 05:49:49 +00:00
Shiyan Deng	685505353a	Back out "Add PyObject preservation for UntypedStorage (#97470 )" (#102553 ) Summary: Original commit changeset: c24708d18ccb Original Phabricator Diff: D46159983 Test Plan: SL tests and CI Differential Revision: D46284986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/102553 Approved by: https://github.com/DanilBaibak	2023-06-01 17:23:43 +00:00
Edward Z. Yang	818d92f58c	Support resize on meta storage (#101988 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-05-25 04:41:45 +00:00
PyTorch MergeBot	210fc28d5e	Revert "Support resize on meta storage (#101988 )" This reverts commit `7d1ba0a92a`. Reverted https://github.com/pytorch/pytorch/pull/101988 on behalf of https://github.com/osalpekar due to Need to revert and rebase this in order to unblock train import ([comment](https://github.com/pytorch/pytorch/pull/101988#issuecomment-1561970230))	2023-05-24 21:51:33 +00:00
Wang, Eikan	2e18dd2bdc	Improve bf16 neg by bypassing the convertion between BF16 and FP32 (#99711 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/99711 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/desertfire	2023-05-24 03:25:23 +00:00
Kazuaki Ishizaki	be5e77ca4c	Make _StorageBase.byteswap faster ( > 10000x) (#101925 ) This PR addresses #101690. This PR implement faster data elements swap in `_StorageBase` using C++ rather than using Python. This PR helps such a situation that a large model saved on a little-endian machine will be loaded on a big-endian machine. TODO: - [x] Add test cases - [x] Add performance comparison before and after the PR - [ ] (Optional) Investigate further opportunities for performance improvements by [SIMDization](https://dev.to/wunk/fast-array-reversal-with-simd-j3p) Fixes #101690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/101925 Approved by: https://github.com/mikaylagawarecki	2023-05-24 00:13:41 +00:00
Edward Z. Yang	7d1ba0a92a	Support resize on meta storage (#101988 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101988 Approved by: https://github.com/albanD, https://github.com/bdhirsh	2023-05-23 16:49:17 +00:00
Kurt Mohler	5fe629e314	Add PyObject preservation for UntypedStorage (#97470 ) Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97470 Approved by: https://github.com/ezyang	2023-05-23 01:27:30 +00:00
drisspg	6f13d6892a	Add meta support for multinomial (#101324 ) # Summary Found this when trying to compile the text gen loop of nanogpt here: `b33289942b/torchbenchmark/models/nanogpt_generate/model.py (L322)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/101324 Approved by: https://github.com/ngimel	2023-05-19 00:04:26 +00:00
Edward Z. Yang	c567748e16	Make interpolate_bilinear deterministic using decomposition (#101115 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101115 Approved by: https://github.com/ngimel	2023-05-11 22:48:01 +00:00
Yu, Guangye	14964b3aa5	Add is_xpu to torch type (#101072 ) # Motivate Without this PR: ```python >>>import torch >>>torch.IntTensor.is_cuda False >>>torch.IntTensor.is_xpu <attribute 'is_xpu' of 'torch._C._TensorBase' objects> ``` With this PR: ```python >>>import torch >>>torch.IntTensor.is_xpu False ``` Align to CUDA, some customer code use is_xpu to check the backend. Without this PR, the check is always True which result in an unexpected behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/101072 Approved by: https://github.com/mikaylagawarecki	2023-05-11 17:50:59 +00:00
vfdev-5	622e582a2b	Register get_cpu_capability for jit (#100723 ) Description: Context: In torchvision we ensure that functional ops are torchscriptable. Recently exposed `torch.backends.cpu.get_cpu_capability()` in https://github.com/pytorch/pytorch/pull/100164 is failing in torchvision CI ``` RuntimeError: Python builtin <built-in function _get_cpu_capability> is currently not supported in Torchscript: File "/usr/local/lib/python3.10/dist-packages/torch/backends/cpu/__init__.py", line 17 - "AVX512" """ return torch._C._get_cpu_capability() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE ``` Ref: https://github.com/pytorch/vision/pull/7557 In this PR, `torch._C._get_cpu_capability()` is explicitly registered for JIT and tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100723 Approved by: https://github.com/albanD	2023-05-09 09:52:29 +00:00
Valentin Andrei	9bc68fcd25	[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 attempt 2) (#100505 ) By knowing the stride value ahead of time, we can simplify the kernel code as follows: If stride == 1 we can use the whole warp to reduce the gradients If stride < warp_size we don't need the internal while (start_feature < stride) loop as blockDim.x is always 32 This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic. The proposed implementation uses opmath_t to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in scalar_t and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference. TEST CODE: ``` # The first element is the number of iterations. # The second represents the number of unique elements. If # set to 0, the number of unique elements is equal to the # number of elements. # The remaining elements are the tensor dimensions. basic_indexing_tests = [ [10, 0, 12345], [10, 4, 12345], [10, 16, 512, 512, 32], [10, 0, 4, 4], [10, 0, 32, 32], [10, 8, 32, 32], [10, 8, 64, 32, 16], [10, 0, 64, 32, 16], [10, 16, 512, 512, 32], [10, 0, 675, 999, 13], [10, 0, 123, 456, 31], [10, 0, 512, 512, 32], [10, 4, 512, 512, 32], [10, 2, 512, 512, 32], [10, 0, 128, 128, 16, 16], [10, 8, 128, 126, 16, 16], [10, 4, 128, 126, 16, 16], [10, 0, 64, 64, 16, 16, 16], [10, 8, 64, 64, 16, 16, 16], [10, 2, 64, 64, 16, 16, 16], [10, 1, 64, 64, 16, 16, 16], ] def run_basic_indexing_on_device(x, index, expected, device_string, iters): x_dev = x.to(device_string) x_dev = x_dev.detach().requires_grad_() index_dev = index.to(device_string) # Run backward pass; keep gradients and measure time torch.cuda.synchronize() t_bw_s = time() for _ in range(iters): y = x_dev[index_dev] z = y.sum() z.backward() torch.cuda.synchronize() t_bw_s = (time() - t_bw_s) / iters return (x_dev.grad, t_bw_s) def run_basic_indexing_test(test_input): tensor_size = tuple(test_input[:5]) niters = test_input[0] num_unique = test_input[1] tensor_size = tuple(test_input[2:]) numel = 1 for dim in tensor_size: numel = dim if num_unique == 0: num_unique = numel index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu") x = torch.randn((numel,), dtype=torch.float32, device="cuda") index = index.detach() x = x.detach().requires_grad_() (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1) (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1) max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu"))) missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu"))) (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device( x, index, numel / 2, "cuda", niters ) print( "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format( tuple(test_input), max_delta, missmatches, t_gpu 1000.0 ) ) if torch.numel(missmatches) > 0: print("cpu grad = {}", cpu_grad[missmatches]) print("gpu grad = {}", gpu_grad[missmatches]) ``` RESULTS: ``` Default Implementation test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133 Small Stride Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318 Stride 1 Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631 test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491 test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477 test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561 test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516 test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455 test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238 test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520 test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854 test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327 test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498 test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684 test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604 test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100505 Approved by: https://github.com/ngimel	2023-05-03 23:52:58 +00:00
vfdev-5	6a12f10b08	Publicly exposing `torch.backends.cpu.get_cpu_capability()` (#100164 ) Description: - As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`. - In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164 Approved by: https://github.com/albanD, https://github.com/malfet	2023-05-03 19:02:07 +00:00
PyTorch MergeBot	1114673c90	Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 )" This reverts commit `97afbcbc80`. Reverted https://github.com/pytorch/pytorch/pull/99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/99441#issuecomment-1531804487))	2023-05-02 16:46:04 +00:00
Lu Fang	090ec55f8d	Only skip in torch inductor test Differential Revision: D45464303nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100435	2023-05-01 22:21:37 -07:00
Lu Fang	429155b3c8	Disable some check to get the test pass Differential Revision: D45437730nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100364	2023-05-01 16:28:12 -07:00
valentinandrei	97afbcbc80	[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 ) By knowing the stride value ahead of time, we can simplify the kernel code as follows: If `stride == 1` we can use the whole warp to reduce the gradients If `stride < warp_size` we don't need the internal `while (start_feature < stride)` loop as `blockDim.x` is always 32 This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic. The proposed implementation uses `opmath_t` to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in `scalar_t` and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference. TEST CODE: ``` # The first element is the number of iterations. # The second represents the number of unique elements. If # set to 0, the number of unique elements is equal to the # number of elements. # The remaining elements are the tensor dimensions. basic_indexing_tests = [ [10, 0, 12345], [10, 4, 12345], [10, 16, 512, 512, 32], [10, 0, 4, 4], [10, 0, 32, 32], [10, 8, 32, 32], [10, 8, 64, 32, 16], [10, 0, 64, 32, 16], [10, 16, 512, 512, 32], [10, 0, 675, 999, 13], [10, 0, 123, 456, 31], [10, 0, 512, 512, 32], [10, 4, 512, 512, 32], [10, 2, 512, 512, 32], [10, 0, 128, 128, 16, 16], [10, 8, 128, 126, 16, 16], [10, 4, 128, 126, 16, 16], [10, 0, 64, 64, 16, 16, 16], [10, 8, 64, 64, 16, 16, 16], [10, 2, 64, 64, 16, 16, 16], [10, 1, 64, 64, 16, 16, 16], ] def run_basic_indexing_on_device(x, index, expected, device_string, iters): x_dev = x.to(device_string) x_dev = x_dev.detach().requires_grad_() index_dev = index.to(device_string) # Run backward pass; keep gradients and measure time torch.cuda.synchronize() t_bw_s = time() for _ in range(iters): y = x_dev[index_dev] z = y.sum() z.backward() torch.cuda.synchronize() t_bw_s = (time() - t_bw_s) / iters return (x_dev.grad, t_bw_s) def run_basic_indexing_test(test_input): tensor_size = tuple(test_input[:5]) niters = test_input[0] num_unique = test_input[1] tensor_size = tuple(test_input[2:]) numel = 1 for dim in tensor_size: numel = dim if num_unique == 0: num_unique = numel index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu") x = torch.randn((numel,), dtype=torch.float32, device="cuda") index = index.detach() x = x.detach().requires_grad_() (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1) (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1) max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu"))) missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu"))) (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device( x, index, numel / 2, "cuda", niters ) print( "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format( tuple(test_input), max_delta, missmatches, t_gpu 1000.0 ) ) if torch.numel(missmatches) > 0: print("cpu grad = {}", cpu_grad[missmatches]) print("gpu grad = {}", gpu_grad[missmatches]) ``` RESULTS: ``` Default Implementation test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133 Small Stride Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318 Stride 1 Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631 test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491 test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477 test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561 test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516 test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455 test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238 test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520 test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854 test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327 test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498 test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684 test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604 test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99441 Approved by: https://github.com/ngimel	2023-05-01 22:41:00 +00:00

1 2 3 4 5 ...

2086 Commits