Commit Graph

1945 Commits

Author SHA1 Message Date
Edward Z. Yang
c567748e16 Make interpolate_bilinear deterministic using decomposition (#101115)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101115
Approved by: https://github.com/ngimel
2023-05-11 22:48:01 +00:00
Yu, Guangye
14964b3aa5 Add is_xpu to torch type (#101072)
# Motivate
Without this PR:
```python
>>>import torch
>>>torch.IntTensor.is_cuda
False
>>>torch.IntTensor.is_xpu
<attribute 'is_xpu' of 'torch._C._TensorBase' objects>
```

With this PR:
```python
>>>import torch
>>>torch.IntTensor.is_xpu
False
```
Align to CUDA, some customer code use is_xpu to check the backend. Without this PR, the check is always True which result in an unexpected behavior

Pull Request resolved: https://github.com/pytorch/pytorch/pull/101072
Approved by: https://github.com/mikaylagawarecki
2023-05-11 17:50:59 +00:00
vfdev-5
622e582a2b Register get_cpu_capability for jit (#100723)
Description:

Context: In torchvision we ensure that functional ops are torchscriptable. Recently exposed `torch.backends.cpu.get_cpu_capability()` in https://github.com/pytorch/pytorch/pull/100164 is failing in torchvision CI
```
RuntimeError:
Python builtin <built-in function _get_cpu_capability> is currently not supported in Torchscript:
  File "/usr/local/lib/python3.10/dist-packages/torch/backends/cpu/__init__.py", line 17
    - "AVX512"
    """
    return torch._C._get_cpu_capability()
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
```
Ref: https://github.com/pytorch/vision/pull/7557

In this PR, `torch._C._get_cpu_capability()` is explicitly registered for JIT and tested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100723
Approved by: https://github.com/albanD
2023-05-09 09:52:29 +00:00
Valentin Andrei
9bc68fcd25 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441 attempt 2) (#100505)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If stride == 1 we can use the whole warp to reduce the gradients
If stride < warp_size we don't need the internal while (start_feature < stride) loop as blockDim.x is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses opmath_t to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in scalar_t and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100505
Approved by: https://github.com/ngimel
2023-05-03 23:52:58 +00:00
vfdev-5
6a12f10b08 Publicly exposing torch.backends.cpu.get_cpu_capability() (#100164)
Description:

- As suggested by Nikita, created `torch.backends.cpu` submodule and exposed `get_cpu_capability`.

- In torchvision Resize method we want to know current cpu capability in order to pick appropriate codepath depending on cpu capablities

Newly coded vectorized resize of uint8 images on AVX2 supported CPUs is now faster than older way (uint8->float->resize->uint8). However, on non-avx hardware (e.g. Mac M1) certain configs are slower using native uint8.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100164
Approved by: https://github.com/albanD, https://github.com/malfet
2023-05-03 19:02:07 +00:00
PyTorch MergeBot
1114673c90 Revert "[pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)"
This reverts commit 97afbcbc80.

Reverted https://github.com/pytorch/pytorch/pull/99441 on behalf of https://github.com/ngimel due to breaks ROCM ([comment](https://github.com/pytorch/pytorch/pull/99441#issuecomment-1531804487))
2023-05-02 16:46:04 +00:00
Lu Fang
090ec55f8d
Only skip in torch inductor test
Differential Revision: D45464303nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100435
2023-05-01 22:21:37 -07:00
Lu Fang
429155b3c8
Disable some check to get the test pass
Differential Revision: D45437730nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100364
2023-05-01 16:28:12 -07:00
valentinandrei
97afbcbc80 [pytorch] Accelerate indexing_backward_kernel with duplicates (#99441)
By knowing the stride value ahead of time, we can simplify the kernel code as follows:

If `stride == 1` we can use the whole warp to reduce the gradients
If `stride < warp_size` we don't need the internal `while (start_feature < stride)` loop as `blockDim.x` is always 32

This changes improve the performance of the kernel when duplicates are present and do not affect the performance with low amount of duplicates. The implementation is deterministic.

The proposed implementation uses `opmath_t` to accumulate in registers the gradient values so when using FP16/BF16 it may overflow if the number of elements is large. This is different from the initial implementation who accumulates in `scalar_t` and does not overflow. In addition, when the stride is 1, we are using warp shuffles to sum the gradient so the order of the addition is slightly different than a reference implementation which causes some minor numerical differences when compared to a reference.

TEST CODE:

```
# The first element is the number of iterations.
# The second represents the number of unique elements. If
# set to 0, the number of unique elements is equal to the
# number of elements.
# The remaining elements are the tensor dimensions.

basic_indexing_tests = [
    [10, 0, 12345],
    [10, 4, 12345],
    [10, 16, 512, 512, 32],
    [10, 0, 4, 4],
    [10, 0, 32, 32],
    [10, 8, 32, 32],
    [10, 8, 64, 32, 16],
    [10, 0, 64, 32, 16],
    [10, 16, 512, 512, 32],
    [10, 0, 675, 999, 13],
    [10, 0, 123, 456, 31],
    [10, 0, 512, 512, 32],
    [10, 4, 512, 512, 32],
    [10, 2, 512, 512, 32],
    [10, 0, 128, 128, 16, 16],
    [10, 8, 128, 126, 16, 16],
    [10, 4, 128, 126, 16, 16],
    [10, 0, 64, 64, 16, 16, 16],
    [10, 8, 64, 64, 16, 16, 16],
    [10, 2, 64, 64, 16, 16, 16],
    [10, 1, 64, 64, 16, 16, 16],
]

def run_basic_indexing_on_device(x, index, expected, device_string, iters):
    x_dev = x.to(device_string)
    x_dev = x_dev.detach().requires_grad_()
    index_dev = index.to(device_string)

    # Run backward pass; keep gradients and measure time
    torch.cuda.synchronize()
    t_bw_s = time()
    for _ in range(iters):
        y = x_dev[index_dev]
        z = y.sum()
        z.backward()
    torch.cuda.synchronize()
    t_bw_s = (time() - t_bw_s) / iters

    return (x_dev.grad, t_bw_s)

def run_basic_indexing_test(test_input):
    tensor_size = tuple(test_input[:5])
    niters = test_input[0]
    num_unique = test_input[1]
    tensor_size = tuple(test_input[2:])

    numel = 1
    for dim in tensor_size:
        numel *= dim
    if num_unique == 0:
        num_unique = numel

    index = torch.randint(0, num_unique, tensor_size, dtype=torch.long, device="cpu")
    x = torch.randn((numel,), dtype=torch.float32, device="cuda")

    index = index.detach()
    x = x.detach().requires_grad_()

    (cpu_grad, t_bw_cpu) = run_basic_indexing_on_device(x, index, numel / 2, "cpu", 1)
    (gpu_grad, t_bw_gpu) = run_basic_indexing_on_device(x, index, numel / 2, "cuda", 1)

    max_delta = torch.max(torch.abs(cpu_grad - gpu_grad.to("cpu")))
    missmatches = torch.nonzero(torch.abs(cpu_grad - gpu_grad.to("cpu")))

    (gpu_grad_perf, t_gpu) = run_basic_indexing_on_device(
        x, index, numel / 2, "cuda", niters
    )

    print(
        "test = {}, delta = {:.5f}, missmatches = {} duration_ms = {:.3f}".format(
            tuple(test_input), max_delta, missmatches, t_gpu * 1000.0
        )
    )

    if torch.numel(missmatches) > 0:
        print("cpu grad = {}", cpu_grad[missmatches])
        print("gpu grad = {}", gpu_grad[missmatches])
```

RESULTS:

```
Default Implementation

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.726
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.867
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.514
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.689
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.547
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.537
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.199
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.584
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 80.055
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.411
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.419
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.048
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.633
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 606.403
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4.099
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 76.813
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.760
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.547
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 317.583
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1204.800
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2412.133

Small Stride Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.904
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.156
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 308.878
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.566
test = (1, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.540
test = (1, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.550
test = (1, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2.868
test = (1, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.656
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 307.856
test = (1, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.624
test = (1, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.837
test = (1, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 6.274
test = (1, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1127.040
test = (1, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 2123.942
test = (1, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.282
test = (1, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 288.997
test = (1, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 547.267
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 12.844
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1178.934
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 4262.042
test = (1, 1, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8172.318

Stride 1 Kernel Version

test = (1, 0, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.692
test = (1, 4, 12345), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.834
test = (1, 16, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 81.023
test = (1, 0, 4, 4), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.631
test = (100, 0, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.491
test = (100, 8, 32, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.477
test = (50, 8, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.561
test = (50, 0, 64, 32, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 0.516
test = (16, 10, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 126.455
test = (10, 0, 675, 999, 13), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 8.238
test = (10, 0, 123, 456, 31), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1.520
test = (10, 0, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 7.854
test = (10, 4, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 306.327
test = (10, 2, 512, 512, 32), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 610.498
test = (5, 0, 128, 128, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 3.684
test = (5, 8, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 75.604
test = (5, 4, 128, 126, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 148.679
test = (1, 0, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 16.525
test = (1, 8, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 315.095
test = (1, 2, 64, 64, 16, 16, 16), delta = 0.00000, missmatches = tensor([], size=(0, 1), dtype=torch.int64) duration_ms = 1214.715
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99441
Approved by: https://github.com/ngimel
2023-05-01 22:41:00 +00:00
Lu Fang
d7fa7fa8cf
Introduce fast path in the CPU equal op
Differential Revision: D45282119nnPull Request resolved: https://github.com/pytorch/pytorch/pull/100024
2023-04-28 16:00:17 -07:00
kshitij12345
61dffa61c3 [fix] masked_scatter_: non-contiguous self (#100232)
Fixes https://github.com/pytorch/pytorch/issues/99638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/100232
Approved by: https://github.com/ngimel
2023-04-28 18:12:23 +00:00
dujinhang
9cd48b0575 Add warning information for dtypetensor. (#99521)
Fixes #ISSUE_NUMBER

Without affecting the existing cpu/cuda logic, a separate interface is provided for the custom backend and users can choose whether to use the interface function which provides 10 tensor types with custom backend variations.

Therefore, users can use torch.set_deafult_tensor_type to set the default device tensor type, or use torch.xxx.dtypetensor to create a tensor.For example,torch.set_deafult_tensor_type(torch.foo.DoubleTensor) or torch.foo.DoubleTensor([]).

@albanD , please review my changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99521
Approved by: https://github.com/albanD
2023-04-28 18:01:45 +00:00
Larry Liu
687afeb686 [dynamo][numpy] Add NumpyTensorVariable to translate ndarray attribute calls to tensor attributes (#95849)
Issue: #93684

# Problem

Reduce graph breaks when dynamo compiles python functions containing numpy functions and ndarray operations.

# Design (as I know it)

* Use torch_np.ndarray(a wrapper of tensor) to back a `VariableTracker`: `NumpyTensorVariable`.
* Translate all attributes and methods calls, on ndarray, to torch_np.ndarray equivalent.

This PR adds `NumpyTensorVariable` and supports:
1.  tensor to ndarray, ndarray to tensor
2. numpy functions such as numpy.meshgrid()
3. ndarray attributes such as `itemsize`, `stride`

Next PR will handle returning `np.ndarray` and add support for ndarray methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95849
Approved by: https://github.com/ezyang
2023-04-27 16:18:35 +00:00
Jiong Gong
e5c9a0fcf5 [dynamo] avoid graph break on repeat_interleave.self_int (#99528)
Address convit_base failure: https://github.com/pytorch/torchdynamo/issues/1886 mentioned in https://github.com/pytorch/pytorch/issues/93777
Also for models like EleutherAI/gpt-j-6B.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99528
Approved by: https://github.com/ezyang
2023-04-25 04:47:39 +00:00
BJ Hargrave
555ab310dc Add itemsize and nbytes properties to Tensor (#98322)
Adds properties for itemsize and nbytes to Tensor matching the properties in NumPy.

Fixes https://github.com/pytorch/pytorch/issues/12728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98322
Approved by: https://github.com/ezyang
2023-04-05 12:11:55 +00:00
Jason Ansel
b96fe9b61c Fix issues related to ClassInstantier in HF models (#97997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97997
Approved by: https://github.com/anijain2305
2023-04-04 00:01:08 +00:00
Jason Ansel
71d850a100 [inductor] Fallback on complex64 kernels (#98155)
Later PRs in this stack fixe graph breaks in GoogleFnet which triggers errors from inductor trying to compile torch.complex64, this fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/98155
Approved by: https://github.com/anijain2305, https://github.com/ngimel
2023-04-03 01:06:43 +00:00
Nikita Shulga
2af09393f9 masked_scatter should accept only bool masks (#97999)
Modify test_torch to check that assert is raised in this case

torch.uint8 usage has been deprecated for a few releases, and errors has been raised for other dtypes on CUDA device, but not on CPU.
This PR finally restricts mask to just `torch.bool`
See https://github.com/pytorch/pytorch/pull/96594 as an example doing it for `torch.masked_fill`

Fixes https://github.com/pytorch/pytorch/issues/94634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97999
Approved by: https://github.com/ngimel
2023-04-01 23:25:25 +00:00
Nikita Shulga
a1dc2b1774 [BE] Remove bool dtype from masked_scatter (#98015)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at a9fa438</samp>

Simplified a test function for `torch.masked_scatter` in `test/test_torch.py` by removing redundant and unnecessary code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98015
Approved by: https://github.com/ezyang
2023-03-31 01:45:57 +00:00
Aleksei Nikiforov
8289120ef0 Revert "test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)" (#97759)
Tests were already fixed in https://github.com/pytorch/pytorch/pull/92834, and these changes instead of also fixing tests are now breaking them again.

This reverts commit 7f94ea8492.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97759
Approved by: https://github.com/janeyx99
2023-03-28 18:43:08 +00:00
Nikita Shulga
542fb0b1fa Specify file encoding in test_torch.py (#97628)
Attempt to fix
```
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 5260: ordinal not in range(128)
```
in https://github.com/pytorch/pytorch/actions/runs/4522628359/jobs/7965372405

In general, it's a good practice to explicitly specify encoding, as otherwise it depends on environment variable and makes tests failures unpredicatble

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97628
Approved by: https://github.com/dagitses, https://github.com/kit1980
2023-03-26 20:03:25 +00:00
Edward Z. Yang
37faa48844 DCE inference graphs too (#97275)
I added a bunch of asserts to verify that I didn't accidentally kill copy_ in the graph, hopefully this combined with our existing tests is good enough.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/97275
Approved by: https://github.com/bdhirsh
2023-03-23 07:02:52 +00:00
Kurt Mohler
fbc803df0c Only warn once for TypedStorage deprecation (#97379)
Fixes #97207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/97379
Approved by: https://github.com/ezyang
2023-03-23 05:40:23 +00:00
Aleksei Nikiforov
7f94ea8492 test/test_torch.py: fix TestTorch::test_from_buffer test (#96952)
Use opposite encoding on big endian systems
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96952
Approved by: https://github.com/ezyang
2023-03-17 14:36:33 +00:00
mingfeima
06054d7df0 fix random output issue on index_select when src is scalar and index is empty (#96408)
Fix https://github.com/pytorch/pytorch/issues/94340
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96408
Approved by: https://github.com/ngimel
2023-03-16 05:30:45 +00:00
Kurt Mohler
06b7285163 Add torch._check* functions analogous to C++ TORCH_CHECK* (#88725)
Adds `_check`, `_check_index`, `_check_value`, `_check_type`, `_check_not_implemented`, `_check_tensor_all`

Part of #72948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88725
Approved by: https://github.com/albanD
2023-03-14 20:44:50 +00:00
kvathupo
2b9d9bcb85 Deprecate non-bool masks in masked_fill (#96594)
__What?__
Per discussion at #94634, deprecate `masked_fill` with non-bool masks. Deprecation warnings were previously added by #22261, but not for Apple MPS. I can revert the MPS changes if deprecation warnings are wanted first tho. See also #96112.

Fixes #85063 and #89320.

__Further Development?__
- Fixed the mask dtype checking for the cuda dispatch for `masked_fill` in `aten/src/ATen/native/cuda/Indexing.cu`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96594
Approved by: https://github.com/malfet, https://github.com/ngimel
2023-03-13 01:41:47 +00:00
Nikita Shulga
1cd0929bf7 [BC] Allow only bool tensors as mask in masked_select (#96112)
`byte` support was marked as deprecated in 1.8, so it's fine to remove this in 2.1 (or even 2.0)
Deprecation warning was added by https://github.com/pytorch/pytorch/pull/22261

Also, fix bunch of syntactic errors in comments

Pull Request resolved: https://github.com/pytorch/pytorch/pull/96112
Approved by: https://github.com/ezyang
2023-03-07 01:43:14 +00:00
puririshi98
8aa34602f7 Jetson Update for CI Redo (#94549)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94549
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-02-21 17:13:38 +00:00
Yuxin Wu
9bb2fe3eae fix numpy1.24 deprecations in unittests (#93997)
Fixes https://github.com/pytorch/pytorch/issues/91329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93997
Approved by: https://github.com/ngimel, https://github.com/jerryzh168
2023-02-18 00:59:09 +00:00
Xuehai Pan
b005ec62b9 [BE] Remove dependency on six and future (#94709)
Remove the Python 2 and 3 compatibility library [six](https://pypi.org/project/six) and [future](https://pypi.org/project/future) and `torch._six`. We only support Python 3.8+ now. It's time to retire them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94709
Approved by: https://github.com/malfet, https://github.com/Skylion007
2023-02-14 09:14:14 +00:00
Brian Hirsh
ceb0f1576b turn functionalization on in aot_autograd inference (#92857)
still waiting for CI fallout
fixes #90759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92857
Approved by: https://github.com/ezyang
2023-02-13 17:48:00 +00:00
Nikita Shulga
4869929f32 Update Triton hash (#94249)
That includes MLIR + latest packaging changes (that also download ptxas from CUDA-12)
Tweak CI to install gcc-9 to build trition

Disable a few tests to make everything be correct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94249
Approved by: https://github.com/Skylion007, https://github.com/ngimel, https://github.com/weiwangmeta
2023-02-13 13:17:36 +00:00
Aaron Gokaslan
9171f7d4cd [BE] Modernize PyTorch even more for 3.8 with pyupgrade (#94520)
Applies some more pyupgrade fixits to PyTorch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/94520
Approved by: https://github.com/ezyang
2023-02-10 18:02:50 +00:00
ganler
db6cfff827 fix: forbid multi-index for index_select over scalar (#94347)
Fixes #88940

According to the [doc](https://pytorch.org/docs/stable/generated/torch.index_select.html):
1. "The returned tensor has the same number of dimensions as the original tensor (`input`). "
2.  "The `dim`th dimension has the same size as the length of `index`; other dimensions have the same size as in the original tensor."

These two conditions cannot be satisfied at the same time if the `input` is a scalar && `index` has multiple values: because a scalar at most holds one element (according to property 1, the output is a scalar), it is impossible to satisfy "The `dim`th dimension has the same size as the length of `index`" when `index` has multiple values.

However, currently, if we do so we either get:

1. Buffer overflow with ASAN;
2. Or (w/o ASAN) silently returns outputs that is not consistent with the doc (`x.index_select(0, torch.Tensor([0, 0, 0]).int())` returns `x`).

As a result, we should explicitly reject such cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/94347
Approved by: https://github.com/malfet
2023-02-10 17:17:09 +00:00
min-jean-cho
900e09c872 [Dynamo] Support torch.Tensor.fn as TorchVariable, not UserDefinedObjectVariable, preventing graph break (#93243)
As found in #92709, thanks to @ngimel and @jansel, currently `torch.Tensor.fn` points to `UserDefinedObjectVariable` rather than `TorchVariable`. The root cause is due to https://github.com/pytorch/pytorch/pull/92709#pullrequestreview-1273357406. To prevent this, build `TorchVariable`  of `torch.Tensor.fn` pointing to `torch.ops.aten.fn`.

This issue propagates to `torch.Tensor.fn` causing graph break with `nopython=True`.
```python
import torch
import torch._dynamo as dynamo

#op = torch.ops.aten.abs_ # no graph break
op = torch.Tensor.abs_ # graph break
args = torch.empty(10)

def foo(args):
    return op(args)

opt_foo = dynamo.optimize("inductor", nopython=True)(foo)
y_ = opt_foo(args)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93243
Approved by: https://github.com/jansel
2023-02-07 09:26:50 +00:00
min-jean-cho
6e1cfcdf4b cauchy_ few fixes (1) check gamma > 0 (2) better dtype error log (#93314)
Related #92047

(1) `torch.Tensor.cauchy_` is missing check for `gamma > 0` (`torch.distributions.cauchy.Cauchy` correctly checks  `gamma > 0`).
(2) add better error log on dtype similar to exponential_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93314
Approved by: https://github.com/jgong5, https://github.com/fritzo, https://github.com/lezcano
2023-02-03 11:56:28 +00:00
min-jean-cho
2f0b0c5dd7 exponential_ few fixes (1) lambda > 0 (2) mkl kernel to continuous (3) better error log on dtype (#92891)
Exponential distribution is continuous. Fixes CPU MKL exponential implementation to exclude integer dtypes.

```python
import torch
dtypes = [torch.uint8, torch.int8, torch.int16, torch.int32, torch.int64]

for dtype in dtypes:
    x = torch.empty(10000, dtype=dtype).exponential_() # should fail !
    print("dtype: ", x.dtype, "sum: ", x.sum())
```

### Additional Context

Related to #92709. This issue propagates to OpInfo of exponential.

```
AssertionError: The supported dtypes for exponential on device type cpu are incorrect!
The following dtypes worked in forward but are not listed by the OpInfo: {torch.int64, torch.uint8, torch.int8, torch.int16, torch.int32}.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92891
Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/ngimel
2023-01-28 02:27:16 +00:00
Yanbo Liang
a6b51448f5 [Dynamo] Supports if condition on user defined object (#90892)
Fixes Meta internal user case, see the pattern in unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90892
Approved by: https://github.com/jansel, https://github.com/mlazos
2023-01-26 04:19:32 +00:00
PyTorch MergeBot
9b23fd378f Revert "Logcumsumexp for complex in CPU and CUDA (#90847)"
This reverts commit 64985123e4.

Reverted https://github.com/pytorch/pytorch/pull/90847 on behalf of https://github.com/malfet due to Reverting to decrease build time, let's discuss the alternatives here
2023-01-24 20:49:08 +00:00
pierreHaslee
1c30844eaa where() function added as a Tensor method as well (#92849)
Fixes #88470

I added the "method" keyword in `aten/src/ATen/native/native_functions.yaml` for the function `where` with Scalar Overload.
This way, you can now use `Tensor.where()` with a scalar parameter the same way `torch.where()` can.

I added a test in `test/test_torch.py` as requested.
It uses the `where()` method on a tensor and then checks it has the same results as the `torch.where()` function.
The test is roughly the same as the one provided by the author of the issue.

PS: this is the second PR I make to resolve this issue, the first one is #92747. I had troubles with commit signatures and is therefore closed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/92849
Approved by: https://github.com/albanD
2023-01-24 03:09:33 +00:00
mfkasim1
64985123e4 Logcumsumexp for complex in CPU and CUDA (#90847)
Another PR towards solving #89205.
What's in this PR:

* The implementation of forward `logcumsumexp` for complex numbers in CPU & CUDA
* The tests on forward call of `logcumsumexp` for complex numbers
* The implementation of backward `logcumsumexp` for complex numbers

What's missing:

* The test on backward gradient of `logcumsumexp` (it complaints `RuntimeError: logcumsumexp does not support automatic differentiation for outputs with complex dtype.` and I don't know how to solve the error and I don't know where to put the test for the backward computation). If possible, I'd like this to be done in this PR.

It's really tricky to handle the edge cases here (i.e. the ones involving `inf`), but I've tried my best to put some comments explaining the reasonings of my decisions in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90847
Approved by: https://github.com/albanD
2023-01-20 15:10:50 +00:00
Kurt Mohler
647b8f8e3e Add TORCH_CHECK_TENSOR_ALL (#89097)
`TORCH_CHECK_TENSOR_ALL(cond, ...)` is a wrapper around `TORCH_CHECK` which allows the condition argument to be a tensor, batched or unbatched. `cond` can be a boolean tensor of any size. If any element is False, or if `cond.numel() == 0`, then `TORCH_CHECK_TENSOR_ALL` raises an error

Part of #72948
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89097
Approved by: https://github.com/zou3519
2023-01-19 21:04:09 +00:00
BowenBao
a72bcb3388 Do not leak SkipFrame exception to parent frames (#91059)
Discovered by https://github.com/pytorch/torchdynamo/issues/2000, we noticed the exception `SkipFrame` to avoid repeatedly compiling frame of loop with graph breaks could leak to parent frames while inlining, which then prevents compiling.

This PR checks at inlining if such exception is raised and would instead raise an `Unsupported` to the outer frame. The original behavior and goal of #88857 is unaffected: the inner frame that has loop would still be skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91059
Approved by: https://github.com/jansel, https://github.com/thiagocrepaldi
2023-01-13 17:11:22 +00:00
XiaobingSuper
1892c75a45 fix norrow_copy correctness issue for non-contiguous input for cpu path(reland) (#91883)
This PR is about re-land https://github.com/pytorch/pytorch/pull/91789.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91883
Approved by: https://github.com/lezcano
2023-01-10 10:56:18 +00:00
PyTorch MergeBot
d85f3c8237 Revert "fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789)"
This reverts commit 136dadd689.

Reverted https://github.com/pytorch/pytorch/pull/91789 on behalf of https://github.com/huydhn due to This breaks trunk with XPASS test_vmap_exhaustive_narrow_copy_cpu_float32 136dadd689
2023-01-09 06:50:20 +00:00
XiaobingSuper
136dadd689 fix norrow_copy correctness issue for non-contiguous input for cpu path (#91789)
Fix https://github.com/pytorch/pytorch/issues/91690.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/91789
Approved by: https://github.com/jgong5, https://github.com/lezcano
2023-01-09 00:55:03 +00:00
PyTorch MergeBot
b3603f8129 Revert "Deduplicate c10 error and PyTorchError hierarchy (#87855)"
This reverts commit 34f2d3e6ae.

Reverted https://github.com/pytorch/pytorch/pull/87855 on behalf of https://github.com/osalpekar due to perf regression in quantization tests
2023-01-06 19:56:35 +00:00
William Phetsinorath
34f2d3e6ae Deduplicate c10 error and PyTorchError hierarchy (#87855)
Fixes #53370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87855
Approved by: https://github.com/albanD
2023-01-02 15:53:36 +00:00
ecao
274d3b24c3 use scatter_add for index_add when dim is the most inner dim (#88729)
### Motivation
When dim is -1 and the slice of source or result is noncontiguous, original `index_add` is slow as it uses add for the sliced tensor, which is serial on index and parallel on sliced tensor to avoid write conflict. Doing parallel on the sliced tensor is not optimal as the size of sliced tensor may be not big enough to parallel and also causes multiple parallelizations.

`scatter_add ` is used to speedup for this case as `scatter_add ` parallels on the outer dimension of input and is serial on the inner dimension to avoid write conflict. `scatter_add ` only need one parallel and the size of outer dimensions is bigger to do parallel.

### Testing

- Single core:

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.82E-03 | 2.11E-03
[10, 128, 50, 50] | 0.023604 | 0.023794

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 9.30E-04 | 1.66E-03
[10, 128, 50, 50] | 0.005995 | 0.010003

- Single socket (28 cores):

Before:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 2.96E-03 | 2.52E-03
[10, 128, 50, 50] | 0.012208 | 0.012568

After:

shape | fp32 / s | bf16 / s
-- | -- | --
[10, 128, 20, 20] | 7.44E-05 | 1.33E-04
[10, 128, 50, 50] | 0.000333 | 0.000469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88729
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2022-12-28 12:04:17 +00:00