pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Edward Z. Yang	c567748e16	Make interpolate_bilinear deterministic using decomposition (#101115 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/101115 Approved by: https://github.com/ngimel	2023-05-11 22:48:01 +00:00
Yu, Guangye	14964b3aa5	Add is_xpu to torch type (#101072 ) # Motivate Without this PR: ```python >>>import torch >>>torch.IntTensor.is_cuda False >>>torch.IntTensor.is_xpu <attribute 'is_xpu' of 'torch._C._TensorBase' objects> ``` With this PR: ```python >>>import torch >>>torch.IntTensor.is_xpu False ``` Align to CUDA, some customer code use is_xpu to check the backend. Without this PR, the check is always True which result in an unexpected behavior Pull Request resolved: https://github.com/pytorch/pytorch/pull/101072 Approved by: https://github.com/mikaylagawarecki	2023-05-11 17:50:59 +00:00
vfdev-5	622e582a2b	Register get_cpu_capability for jit (#100723 ) Description: Context: In torchvision we ensure that functional ops are torchscriptable. Recently exposed `torch.backends.cpu.get_cpu_capability()` in https://github.com/pytorch/pytorch/pull/100164 is failing in torchvision CI ``` RuntimeError: Python builtin <built-in function _get_cpu_capability> is currently not supported in Torchscript: File "/usr/local/lib/python3.10/dist-packages/torch/backends/cpu/__init__.py", line 17 - "AVX512" """ return torch._C._get_cpu_capability() ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE ``` Ref: https://github.com/pytorch/vision/pull/7557 In this PR, `torch._C._get_cpu_capability()` is explicitly registered for JIT and tested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100723 Approved by: https://github.com/albanD	2023-05-09 09:52:29 +00:00
Valentin Andrei	9bc68fcd25	[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 attempt 2) (#100505 ) By knowing the stride value ahead of time, we can simplify the kernel code as follows: If stride == 1 we can use the whole warp to reduce the gradients If stride < warp_size we don't need the internal while (start_feature < stride) loop as blockDim.x is always 32 This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic. The proposed implementation uses opmath_t to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in scalar_t and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference. TEST CODE: ``` # The first element is the number of iterations. # The second represents the number of unique elements. If # set to 0, the number of unique elements is equal to the # number of elements. # The remaining elements are the tensor dimensions. basic_indexing_tests = [ [10, 0, 12345], [10, 4, 12345], [10, 16, 512, 512, 32], [10, 0, 4, 4], [10, 0, 32, 32], [10, 8, 32, 32], [10, 8, 64, 32, 16], [10, 0, 64, 32, 16], [10, 16, 512, 512, 32], [10, 0, 675, 999, 13], [10, 0, 123, 456, 31], [10, 0, 512, 512, 32], [10, 4, 512, 512, 32], [10, 2, 512, 512, 32], [10, 0, 128, 128, 16, 16], [10, 8, 128, 126, 16, 16], [10, 4, 128, 126, 16, 16], [10, 0, 64, 64, 16, 16, 16], [10, 8, 64, 64, 16, 16, 16], [10, 2, 64, 64, 16, 16, 16], [10, 1, 64, 64, 16, 16, 16], ] def run_basic_indexing_on_device(x, index, expected, device_string, iters): x_dev = x.to(device_string) x_dev = x_dev.detach().requires_grad_() index_dev = index.to(device_string) # Run backward pass; keep gradients and measure time torch.cuda.synchronize() t_bw_s = time() for _ in range(iters): y = x_dev[index_dev] z = y.sum() z.backward() torch.cuda.synchronize() t_bw_s = (time() - t_bw_s) / iters return (x_dev.grad, t_bw_s) def run_basic_indexing_test(test_input): tensor_size = tuple(test_input[:5]) niters = test_input[0] num_unique = test_input[1] tensor_size = tuple(test_input[2:]) numel = 1 for dim in tensor_size: numel = dim if num_unique == 0: num_unique = numel index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu") x = torch.randn((numel,), dtype=torch.float32, device="cuda") index = index.detach() x = x.detach().requires_grad_() (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1) (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1) max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu"))) missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu"))) (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device( x, index, numel / 2, "cuda", niters ) print( "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format( tuple(test_input), max_delta, missmatches, t_gpu 1000.0 ) ) if torch.numel(missmatches) > 0: print("cpu grad = {}", cpu_grad[missmatches]) print("gpu grad = {}", gpu_grad[missmatches]) ``` RESULTS: ``` Default Implementation test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133 Small Stride Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318 Stride 1 Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631 test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491 test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477 test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561 test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516 test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455 test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238 test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520 test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854 test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327 test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498 test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684 test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604 test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/100505 Approved by: https://github.com/ngimel	2023-05-03 23:52:58 +00:00
vfdev-5	6a12f10b08	Publicly exposing `torch.backends.cpu.get_cpu_capability()` (#100164 ) Description: - As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`. - In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8. Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164 Approved by: https://github.com/albanD, https://github.com/malfet	2023-05-03 19:02:07 +00:00
PyTorch MergeBot	1114673c90	Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 )" This reverts commit `97afbcbc80`. Reverted https://github.com/pytorch/pytorch/pull/99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/99441#issuecomment-1531804487))	2023-05-02 16:46:04 +00:00
Lu Fang	090ec55f8d	Only skip in torch inductor test Differential Revision: D45464303nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100435	2023-05-01 22:21:37 -07:00
Lu Fang	429155b3c8	Disable some check to get the test pass Differential Revision: D45437730nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100364	2023-05-01 16:28:12 -07:00
valentinandrei	97afbcbc80	[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 ) By knowing the stride value ahead of time, we can simplify the kernel code as follows: If `stride == 1` we can use the whole warp to reduce the gradients If `stride < warp_size` we don't need the internal `while (start_feature < stride)` loop as `blockDim.x` is always 32 This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic. The proposed implementation uses `opmath_t` to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in `scalar_t` and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference. TEST CODE: ``` # The first element is the number of iterations. # The second represents the number of unique elements. If # set to 0, the number of unique elements is equal to the # number of elements. # The remaining elements are the tensor dimensions. basic_indexing_tests = [ [10, 0, 12345], [10, 4, 12345], [10, 16, 512, 512, 32], [10, 0, 4, 4], [10, 0, 32, 32], [10, 8, 32, 32], [10, 8, 64, 32, 16], [10, 0, 64, 32, 16], [10, 16, 512, 512, 32], [10, 0, 675, 999, 13], [10, 0, 123, 456, 31], [10, 0, 512, 512, 32], [10, 4, 512, 512, 32], [10, 2, 512, 512, 32], [10, 0, 128, 128, 16, 16], [10, 8, 128, 126, 16, 16], [10, 4, 128, 126, 16, 16], [10, 0, 64, 64, 16, 16, 16], [10, 8, 64, 64, 16, 16, 16], [10, 2, 64, 64, 16, 16, 16], [10, 1, 64, 64, 16, 16, 16], ] def run_basic_indexing_on_device(x, index, expected, device_string, iters): x_dev = x.to(device_string) x_dev = x_dev.detach().requires_grad_() index_dev = index.to(device_string) # Run backward pass; keep gradients and measure time torch.cuda.synchronize() t_bw_s = time() for _ in range(iters): y = x_dev[index_dev] z = y.sum() z.backward() torch.cuda.synchronize() t_bw_s = (time() - t_bw_s) / iters return (x_dev.grad, t_bw_s) def run_basic_indexing_test(test_input): tensor_size = tuple(test_input[:5]) niters = test_input[0] num_unique = test_input[1] tensor_size = tuple(test_input[2:]) numel = 1 for dim in tensor_size: numel = dim if num_unique == 0: num_unique = numel index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu") x = torch.randn((numel,), dtype=torch.float32, device="cuda") index = index.detach() x = x.detach().requires_grad_() (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1) (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1) max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu"))) missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu"))) (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device( x, index, numel / 2, "cuda", niters ) print( "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format( tuple(test_input), max_delta, missmatches, t_gpu 1000.0 ) ) if torch.numel(missmatches) > 0: print("cpu grad = {}", cpu_grad[missmatches]) print("gpu grad = {}", gpu_grad[missmatches]) ``` RESULTS: ``` Default Implementation test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133 Small Stride Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566 test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540 test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550 test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868 test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856 test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624 test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837 test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274 test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040 test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942 test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282 test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997 test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042 test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318 Stride 1 Kernel Version test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692 test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834 test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023 test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631 test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491 test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477 test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561 test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516 test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455 test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238 test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520 test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854 test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327 test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498 test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684 test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604 test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679 test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525 test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095 test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/99441 Approved by: https://github.com/ngimel	2023-05-01 22:41:00 +00:00
Lu Fang	d7fa7fa8cf	Introduce fast path in the CPU equal op Differential Revision: D45282119nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100024	2023-04-28 16:00:17 -07:00
kshitij12345	61dffa61c3	[fix] masked_scatter_: non-contiguous self (#100232 ) Fixes https://github.com/pytorch/pytorch/issues/99638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/100232 Approved by: https://github.com/ngimel	2023-04-28 18:12:23 +00:00
dujinhang	9cd48b0575	Add warning information for dtypetensor. (#99521 ) Fixes #ISSUE_NUMBER Without affecting the existing cpu/cuda logic, a separate interface is provided for the custom backend and users can choose whether to use the interface function which provides 10 tensor types with custom backend variations. Therefore, users can use torch.set_deafult_tensor_type to set the default device tensor type, or use torch.xxx.dtypetensor to create a tensor.For example，torch.set_deafult_tensor_type(torch.foo.DoubleTensor) or torch.foo.DoubleTensor([]). @albanD , please review my changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99521 Approved by: https://github.com/albanD	2023-04-28 18:01:45 +00:00
Larry Liu	687afeb686	[dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849 ) Issue: #93684 # Problem Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations. # Design (as I know it) * Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`. * Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent. This PR adds `NumpyTensorVariable` and supports: 1. tensor to ndarray, ndarray to tensor 2. numpy functions such as numpy.meshgrid() 3. ndarray attributes such as `itemsize`, `stride` Next PR will handle returning `np.ndarray` and add support for ndarray methods Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849 Approved by: https://github.com/ezyang	2023-04-27 16:18:35 +00:00
Jiong Gong	e5c9a0fcf5	[dynamo] avoid graph break on repeat_interleave.self_int (#99528 ) Address convit_base failure: https://github.com/pytorch/torchdynamo/issues/1886 mentioned in https://github.com/pytorch/pytorch/issues/93777 Also for models like EleutherAI/gpt-j-6B. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99528 Approved by: https://github.com/ezyang	2023-04-25 04:47:39 +00:00
BJ Hargrave	555ab310dc	Add itemsize and nbytes properties to Tensor (#98322 ) Adds properties for itemsize and nbytes to Tensor matching the properties in NumPy. Fixes https://github.com/pytorch/pytorch/issues/12728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/98322 Approved by: https://github.com/ezyang	2023-04-05 12:11:55 +00:00
Jason Ansel	b96fe9b61c	Fix issues related to ClassInstantier in HF models (#97997 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97997 Approved by: https://github.com/anijain2305	2023-04-04 00:01:08 +00:00
Jason Ansel	71d850a100	[inductor] Fallback on complex64 kernels (#98155 ) Later PRs in this stack fixe graph breaks in GoogleFnet which triggers errors from inductor trying to compile torch.complex64, this fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98155 Approved by: https://github.com/anijain2305, https://github.com/ngimel	2023-04-03 01:06:43 +00:00
Nikita Shulga	2af09393f9	`masked_scatter` should accept only bool masks (#97999 ) Modify test_torch to check that assert is raised in this case torch.uint8 usage has been deprecated for a few releases, and errors has been raised for other dtypes on CUDA device, but not on CPU. This PR finally restricts mask to just `torch.bool` See https://github.com/pytorch/pytorch/pull/96594 as an example doing it for `torch.masked_fill` Fixes https://github.com/pytorch/pytorch/issues/94634 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97999 Approved by: https://github.com/ngimel	2023-04-01 23:25:25 +00:00
Nikita Shulga	a1dc2b1774	[BE] Remove bool dtype from `masked_scatter` (#98015 ) <!-- copilot:summary --> ### <samp>🤖 Generated by Copilot at a9fa438</samp> Simplified a test function for `torch.masked_scatter` in `test/test_torch.py` by removing redundant and unnecessary code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/98015 Approved by: https://github.com/ezyang	2023-03-31 01:45:57 +00:00
Aleksei Nikiforov	8289120ef0	Revert "test/test_torch.py: fix TestTorch::test_from_buffer test (#96952 )" (#97759 ) Tests were already fixed in https://github.com/pytorch/pytorch/pull/92834, and these changes instead of also fixing tests are now breaking them again. This reverts commit `7f94ea8492`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/97759 Approved by: https://github.com/janeyx99	2023-03-28 18:43:08 +00:00
Nikita Shulga	542fb0b1fa	Specify file encoding in test_torch.py (#97628 ) Attempt to fix ``` UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 5260: ordinal not in range(128) ``` in https://github.com/pytorch/pytorch/actions/runs/4522628359/jobs/7965372405 In general, it's a good practice to explicitly specify encoding, as otherwise it depends on environment variable and makes tests failures unpredicatble Pull Request resolved: https://github.com/pytorch/pytorch/pull/97628 Approved by: https://github.com/dagitses, https://github.com/kit1980	2023-03-26 20:03:25 +00:00
Edward Z. Yang	37faa48844	DCE inference graphs too (#97275 ) I added a bunch of asserts to verify that I didn't accidentally kill copy_ in the graph, hopefully this combined with our existing tests is good enough. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/97275 Approved by: https://github.com/bdhirsh	2023-03-23 07:02:52 +00:00
Kurt Mohler	fbc803df0c	Only warn once for TypedStorage deprecation (#97379 ) Fixes #97207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/97379 Approved by: https://github.com/ezyang	2023-03-23 05:40:23 +00:00
Aleksei Nikiforov	7f94ea8492	test/test_torch.py: fix TestTorch::test_from_buffer test (#96952 ) Use opposite encoding on big endian systems Pull Request resolved: https://github.com/pytorch/pytorch/pull/96952 Approved by: https://github.com/ezyang	2023-03-17 14:36:33 +00:00
mingfeima	06054d7df0	fix random output issue on index_select when src is scalar and index is empty (#96408 ) Fix https://github.com/pytorch/pytorch/issues/94340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/96408 Approved by: https://github.com/ngimel	2023-03-16 05:30:45 +00:00
Kurt Mohler	06b7285163	Add `torch._check` functions analogous to C++ `TORCH_CHECK` (#88725 ) Adds `_check`, `_check_index`, `_check_value`, `_check_type`, `_check_not_implemented`, `_check_tensor_all` Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88725 Approved by: https://github.com/albanD	2023-03-14 20:44:50 +00:00
kvathupo	2b9d9bcb85	Deprecate non-bool masks in masked_fill (#96594 ) __What?__ Per discussion at #94634, deprecate `masked_fill` with non-bool masks. Deprecation warnings were previously added by #22261, but not for Apple MPS. I can revert the MPS changes if deprecation warnings are wanted first tho. See also #96112. Fixes #85063 and #89320. __Further Development?__ - Fixed the mask dtype checking for the cuda dispatch for `masked_fill` in `aten/src/ATen/native/cuda/Indexing.cu` Pull Request resolved: https://github.com/pytorch/pytorch/pull/96594 Approved by: https://github.com/malfet, https://github.com/ngimel	2023-03-13 01:41:47 +00:00
Nikita Shulga	1cd0929bf7	[BC] Allow only `bool` tensors as mask in `masked_select` (#96112 ) `byte` support was marked as deprecated in 1.8, so it's fine to remove this in 2.1 (or even 2.0) Deprecation warning was added by https://github.com/pytorch/pytorch/pull/22261 Also, fix bunch of syntactic errors in comments Pull Request resolved: https://github.com/pytorch/pytorch/pull/96112 Approved by: https://github.com/ezyang	2023-03-07 01:43:14 +00:00
puririshi98	8aa34602f7	Jetson Update for CI Redo (#94549 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549 Approved by: https://github.com/ezyang, https://github.com/malfet	2023-02-21 17:13:38 +00:00
Yuxin Wu	9bb2fe3eae	fix numpy1.24 deprecations in unittests (#93997 ) Fixes https://github.com/pytorch/pytorch/issues/91329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/93997 Approved by: https://github.com/ngimel, https://github.com/jerryzh168	2023-02-18 00:59:09 +00:00
Xuehai Pan	b005ec62b9	[BE] Remove dependency on `six` and `future` (#94709 ) Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709 Approved by: https://github.com/malfet, https://github.com/Skylion007	2023-02-14 09:14:14 +00:00
Brian Hirsh	ceb0f1576b	turn functionalization on in aot_autograd inference (#92857 ) still waiting for CI fallout fixes #90759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/92857 Approved by: https://github.com/ezyang	2023-02-13 17:48:00 +00:00
Nikita Shulga	4869929f32	Update Triton hash (#94249 ) That includes MLIR + latest packaging changes (that also download ptxas from CUDA-12) Tweak CI to install gcc-9 to build trition Disable a few tests to make everything be correct Pull Request resolved: https://github.com/pytorch/pytorch/pull/94249 Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/weiwangmeta	2023-02-13 13:17:36 +00:00
Aaron Gokaslan	9171f7d4cd	[BE] Modernize PyTorch even more for 3.8 with pyupgrade (#94520 ) Applies some more pyupgrade fixits to PyTorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/94520 Approved by: https://github.com/ezyang	2023-02-10 18:02:50 +00:00
ganler	db6cfff827	fix: forbid multi-index for index_select over scalar (#94347 ) Fixes #88940 According to the [doc](https://pytorch.org/docs/stable/generated/torch.index_select.html): 1. "The returned tensor has the same number of dimensions as the original tensor (`input`). " 2. "The `dim`th dimension has the same size as the length of `index`; other dimensions have the same size as in the original tensor." These two conditions cannot be satisfied at the same time if the `input` is a scalar && `index` has multiple values: because a scalar at most holds one element (according to property 1, the output is a scalar), it is impossible to satisfy "The `dim`th dimension has the same size as the length of `index`" when `index` has multiple values. However, currently, if we do so we either get: 1. Buffer overflow with ASAN; 2. Or (w/o ASAN) silently returns outputs that is not consistent with the doc (`x.index_select(0, torch.Tensor([0, 0, 0]).int())` returns `x`). As a result, we should explicitly reject such cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/94347 Approved by: https://github.com/malfet	2023-02-10 17:17:09 +00:00
min-jean-cho	900e09c872	[Dynamo] Support torch.Tensor.fn as TorchVariable, not UserDefinedObjectVariable, preventing graph break (#93243 ) As found in #92709, thanks to @ngimel and @jansel, currently `torch.Tensor.fn` points to `UserDefinedObjectVariable` rather than `TorchVariable`. The root cause is due to https://github.com/pytorch/pytorch/pull/92709#pullrequestreview-1273357406. To prevent this, build `TorchVariable` of `torch.Tensor.fn` pointing to `torch.ops.aten.fn`. This issue propagates to `torch.Tensor.fn` causing graph break with `nopython=True`. ```python import torch import torch._dynamo as dynamo #op = torch.ops.aten.abs_ # no graph break op = torch.Tensor.abs_ # graph break args = torch.empty(10) def foo(args): return op(args) opt_foo = dynamo.optimize("inductor", nopython=True)(foo) y_ = opt_foo(args) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/93243 Approved by: https://github.com/jansel	2023-02-07 09:26:50 +00:00
min-jean-cho	6e1cfcdf4b	cauchy_ few fixes (1) check gamma > 0 (2) better dtype error log (#93314 ) Related #92047 (1) `torch.Tensor.cauchy_` is missing check for `gamma > 0` (`torch.distributions.cauchy.Cauchy` correctly checks `gamma > 0`). (2) add better error log on dtype similar to exponential_ Pull Request resolved: https://github.com/pytorch/pytorch/pull/93314 Approved by: https://github.com/jgong5, https://github.com/fritzo, https://github.com/lezcano	2023-02-03 11:56:28 +00:00
min-jean-cho	2f0b0c5dd7	exponential_ few fixes (1) lambda > 0 (2) mkl kernel to continuous (3) better error log on dtype (#92891 ) Exponential distribution is continuous. Fixes CPU MKL exponential implementation to exclude integer dtypes. ```python import torch dtypes = [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64] for dtype in dtypes: x = torch.empty(10000, dtype=dtype).exponential_() # should fail ! print("dtype: ", x.dtype, "sum: ", x.sum()) ``` ### Additional Context Related to #92709. This issue propagates to OpInfo of exponential. ``` AssertionError: The supported dtypes for exponential on device type cpu are incorrect! The following dtypes worked in forward but are not listed by the OpInfo: {torch.int64, torch.uint8, torch.int8, torch.int16, torch.int32}. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/92891 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/ngimel	2023-01-28 02:27:16 +00:00
Yanbo Liang	a6b51448f5	[Dynamo] Supports if condition on user defined object (#90892 ) Fixes Meta internal user case, see the pattern in unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90892 Approved by: https://github.com/jansel, https://github.com/mlazos	2023-01-26 04:19:32 +00:00
PyTorch MergeBot	9b23fd378f	Revert "Logcumsumexp for complex in CPU and CUDA (#90847 )" This reverts commit `64985123e4`. Reverted https://github.com/pytorch/pytorch/pull/90847 on behalf of https://github.com/malfet due to Reverting to decrease build time, let's discuss the alternatives here	2023-01-24 20:49:08 +00:00
pierreHaslee	1c30844eaa	where() function added as a Tensor method as well (#92849 ) Fixes #88470 I added the "method" keyword in `aten/src/ATen/native/native_functions.yaml` for the function `where` with Scalar Overload. This way, you can now use `Tensor.where()` with a scalar parameter the same way `torch.where()` can. I added a test in `test/test_torch.py` as requested. It uses the `where()` method on a tensor and then checks it has the same results as the `torch.where()` function. The test is roughly the same as the one provided by the author of the issue. PS: this is the second PR I make to resolve this issue, the first one is #92747. I had troubles with commit signatures and is therefore closed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/92849 Approved by: https://github.com/albanD	2023-01-24 03:09:33 +00:00
mfkasim1	64985123e4	Logcumsumexp for complex in CPU and CUDA (#90847 ) Another PR towards solving #89205. What's in this PR: * The implementation of forward `logcumsumexp` for complex numbers in CPU & CUDA * The tests on forward call of `logcumsumexp` for complex numbers * The implementation of backward `logcumsumexp` for complex numbers What's missing: * The test on backward gradient of `logcumsumexp` (it complaints `RuntimeError: logcumsumexp does not support automatic differentiation for outputs with complex dtype.` and I don't know how to solve the error and I don't know where to put the test for the backward computation). If possible, I'd like this to be done in this PR. It's really tricky to handle the edge cases here (i.e. the ones involving `inf`), but I've tried my best to put some comments explaining the reasonings of my decisions in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/90847 Approved by: https://github.com/albanD	2023-01-20 15:10:50 +00:00
Kurt Mohler	647b8f8e3e	Add TORCH_CHECK_TENSOR_ALL (#89097 ) `TORCH_CHECK_TENSOR_ALL(cond, ...)` is a wrapper around `TORCH_CHECK` which allows the condition argument to be a tensor, batched or unbatched. `cond` can be a boolean tensor of any size. If any element is False, or if `cond.numel() == 0`, then `TORCH_CHECK_TENSOR_ALL` raises an error Part of #72948 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89097 Approved by: https://github.com/zou3519	2023-01-19 21:04:09 +00:00
BowenBao	a72bcb3388	Do not leak SkipFrame exception to parent frames (#91059 ) Discovered by https://github.com/pytorch/torchdynamo/issues/2000, we noticed the exception `SkipFrame` to avoid repeatedly compiling frame of loop with graph breaks could leak to parent frames while inlining, which then prevents compiling. This PR checks at inlining if such exception is raised and would instead raise an `Unsupported` to the outer frame. The original behavior and goal of #88857 is unaffected: the inner frame that has loop would still be skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91059 Approved by: https://github.com/jansel, https://github.com/thiagocrepaldi	2023-01-13 17:11:22 +00:00
XiaobingSuper	1892c75a45	fix norrow_copy correctness issue for non-contiguous input for cpu path(reland) (#91883 ) This PR is about re-land https://github.com/pytorch/pytorch/pull/91789. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91883 Approved by: https://github.com/lezcano	2023-01-10 10:56:18 +00:00
PyTorch MergeBot	d85f3c8237	Revert "fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789 )" This reverts commit `136dadd689`. Reverted https://github.com/pytorch/pytorch/pull/91789 on behalf of https://github.com/huydhn due to This breaks trunk with XPASS test_vmap_exhaustive_narrow_copy_cpu_float32 `136dadd689`	2023-01-09 06:50:20 +00:00
XiaobingSuper	136dadd689	fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789 ) Fix https://github.com/pytorch/pytorch/issues/91690. Pull Request resolved: https://github.com/pytorch/pytorch/pull/91789 Approved by: https://github.com/jgong5, https://github.com/lezcano	2023-01-09 00:55:03 +00:00
PyTorch MergeBot	b3603f8129	Revert "Deduplicate c10 error and PyTorchError hierarchy (#87855 )" This reverts commit `34f2d3e6ae`. Reverted https://github.com/pytorch/pytorch/pull/87855 on behalf of https://github.com/osalpekar due to perf regression in quantization tests	2023-01-06 19:56:35 +00:00
William Phetsinorath	34f2d3e6ae	Deduplicate c10 error and PyTorchError hierarchy (#87855 ) Fixes #53370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/87855 Approved by: https://github.com/albanD	2023-01-02 15:53:36 +00:00
ecao	274d3b24c3	use scatter_add for index_add when dim is the most inner dim (#88729 ) ### Motivation When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations. `scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel. ### Testing - Single core: Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.82E-03 \| 2.11E-03 [10, 128, 50, 50] \| 0.023604 \| 0.023794 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 9.30E-04 \| 1.66E-03 [10, 128, 50, 50] \| 0.005995 \| 0.010003 - Single socket (28 cores): Before: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 2.96E-03 \| 2.52E-03 [10, 128, 50, 50] \| 0.012208 \| 0.012568 After: shape \| fp32 / s \| bf16 / s -- \| -- \| -- [10, 128, 20, 20] \| 7.44E-05 \| 1.33E-04 [10, 128, 50, 50] \| 0.000333 \| 0.000469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729 Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet	2022-12-28 12:04:17 +00:00

1 2 3 4 5 ...

1945 Commits