Commit Graph

54743 Commits

Author SHA1 Message Date
Michael Voznesensky
3b9a386d48 Add TORCH_FAKE_TENSOR_DEBUG use it to enable storage of traces on fake tensors at init time (#90215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90215
Approved by: https://github.com/ezyang
2022-12-06 22:28:52 +00:00
William Wen
d224ac7f77 Remove logging.CODE (#90234)
Fixes https://github.com/pytorch/torchdynamo/issues/1932

Discussed with @mlazos: if we still want to separate streams for code logging and the rest of info, we can use a separate logger object with a unique name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90234
Approved by: https://github.com/ezyang
2022-12-06 22:24:43 +00:00
Sergii Dymchenko
14894a7311 Remove non-existing parameter from docstring (#90163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90163
Approved by: https://github.com/clee2000
2022-12-06 22:22:17 +00:00
Yanbo Liang
7e9a8a1361 Disable dynamo tracing torchrec.distributed (#90087)
Summary: Context at T138318923

Test Plan: mannual test

Reviewed By: yf225

Differential Revision: D41631076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90087
Approved by: https://github.com/yf225
2022-12-06 22:17:16 +00:00
Eli Uriegas
27ad2605c8 Hotfix to unblock TRT unit tests internally (#90313)
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>

Export of [D41778303](https://www.internalfb.com/diff/D41778303)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90313
Approved by: https://github.com/ezyang, https://github.com/malfet
2022-12-06 22:14:37 +00:00
eqy
62e450d55f [CUDA Graphs] Add option to dump a captured graph for debugging (#85519)
CC @xwang233 @ptrblck @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85519
Approved by: https://github.com/ngimel
2022-12-06 22:03:05 +00:00
fduwjj
1abe264ef0 [Upstream _NamedOptimzer] Reland PR (89480) (#90293)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Reland https://github.com/pytorch/pytorch/pull/89480/
* #90294
* __->__ #90293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90293
Approved by: https://github.com/awgu
2022-12-06 21:47:12 +00:00
Andrew Gu
7436b19eb2 [FSDP] Clarify loss dtype check in _test_fsdp_parity (#90251)
A recent PR deprecated `torch.testing.assert_allclose` in favor of `torch.testing.assert_close` and left a `TODO`. This PR follows up to confirm that we do intend to have `check_dtype=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90251
Approved by: https://github.com/rohan-varma
2022-12-06 21:28:40 +00:00
Andrew Gu
919e09f26a [FSDP][BE] Clean up dead code from clip_grad_norm_() testing (#90250)
`FSDP.clip_grad_norm_()` is tested separately in `test_fsdp_clip_grad_norm.py`. This PR removes the dead non-run code from `common_fsdp.py` and `test_fsdp_core.py` related to `clip_grad_norm_()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90250
Approved by: https://github.com/rohan-varma
2022-12-06 21:28:40 +00:00
Andrew Gu
3b578edd04 [FSDP] Test use_orig_params=True in test_fsdp_ignored_modules.py (#90290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90290
Approved by: https://github.com/zhaojuanmao
2022-12-06 21:28:40 +00:00
Yanbo Liang
25f39c1bce Fix uniform ref implementation (#90094)
Fixes https://github.com/pytorch/torchdynamo/issues/1954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90094
Approved by: https://github.com/ngimel
2022-12-06 21:28:17 +00:00
Edward Z. Yang
a1ab06ab65 ShapeEnv.create_symbolic_sizes_strides_storage_offset (#89962)
Instead of having storage offset hang out on its own, allocate
all of these symbols all in one go.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89962
Approved by: https://github.com/albanD, https://github.com/voznesenskym
2022-12-06 21:27:02 +00:00
Charlie Yan
e818c36647 reland #89222: [Composable API] replicate: change to per module call, remove mark_root_module() (#90254)
reland https://github.com/pytorch/pytorch/pull/89222
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90254
Approved by: https://github.com/zhaojuanmao
2022-12-06 21:17:53 +00:00
Andrew Gu
bd9ad89a6d [FSDP] Fix accidental change in _test_fsdp_parity (#90252)
I accidentally changed the semantics of this line when refactoring a while ago. The [previous version](https://github.com/pytorch/pytorch/pull/80873/files#diff-7b5c66f99161fa6a3d9042e80f8c8cc140a64e43445feede46f55e53154f6c3dL635) used to say:
```
if not mixed_precision:
```
which is actually the opposite of
```
if mixed_precision is not None:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90252
Approved by: https://github.com/zhaojuanmao
2022-12-06 20:13:21 +00:00
mfkasim1
ce21262808 Log1p for complex in CPU (#89691)
Another PR for https://github.com/pytorch/pytorch/issues/89205: making torch.log1p accepts complex numbers in CPU.
I haven't done the GPU version because I'm not sure which file(s) to change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89691
Approved by: https://github.com/jgong5, https://github.com/lezcano
2022-12-06 19:12:24 +00:00
Wanchao Liang
9e314bd822 [dtensor] handle the case where output of op is Optional[Tensor] (#90241)
Observed by @aazzolini, some op might have Optional[Tensor] returns
where it return None (i.e. native_layer_norm_backward), it's a mismatch
between C++ aten op signature and python None, but we need to handle it
in the python side
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90241
Approved by: https://github.com/aazzolini
2022-12-06 18:17:20 +00:00
Edward Z. Yang
eace084815 Use Sized not Iterable to test for len (#90182)
Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90182
Approved by: https://github.com/albanD
2022-12-06 16:13:14 +00:00
mingfeima
c6942dbbfb add shape check for random_samples in fractional_max_pool{2d|3d} (#89992)
This PR add shape checks for `random_samples` in fractional_max_pool2d and fractional_max_pool3d.,
to provide more meaningful warnings instead of SegFault when the input is illegal.

For more details, please check https://github.com/pytorch/pytorch/issues/89648
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89992
Approved by: https://github.com/jgong5, https://github.com/ezyang
2022-12-06 14:14:41 +00:00
mikey dagitses
be5108d5f9 replace memset with value-initialization (#90048)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/pytorch/pull/90048).
* #89865
* #89852
* #89851
* __->__ #90048

replace memset with value-initialization

Summary:
This is equivalent to zero initialization for any members that are
scalar or have implicit default constructors.

Note that aside from the reset at the beginning, blockmask and
philox_args are not touched by this function.

Test Plan: Rely on CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90048
Approved by: https://github.com/drisspg, https://github.com/malfet
2022-12-06 13:48:05 +00:00
Xia, Weiwen
97e47a52b8 [Quant] Add fused linear-leaky_relu op for onednn backend (#88478)
**Summary**
Post op fusion can reduce data movement overhead and improve inference performance. This PR adds fused `linear-leaky_relu` op for `onednn` backend, which will be used for int8 inference with `onednn` backend. Cannot call this op with other quantization backends otherwise an error is thrown.

**Test Plan**
python test_quantization.py TestQuantizedLinear

Pull Request resolved: https://github.com/pytorch/pytorch/pull/88478
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2022-12-06 08:32:59 +00:00
AllenTiTaiWang
41bfa49db9 [ONNX] Add src/index dynamic axes support for aten::scatter_add (#90090)
Extend from #89787 , and answer from https://github.com/onnx/onnx/issues/4672, dynamically catching shape of index can let converter further support on this op.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90090
Approved by: https://github.com/BowenBao
2022-12-06 07:56:20 +00:00
PyTorch MergeBot
176b962f4b Revert "[PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)"
This reverts commit 31ec1a1ef7.

Reverted https://github.com/pytorch/pytorch/pull/89480 on behalf of https://github.com/kit1980 due to Broke test_correct_module_names
2022-12-06 07:22:37 +00:00
Ryan Spring
3c9431f505 Add factory functions to python frontend (#89230)
- Add `full` nvprim to support factory functions because the full reference uses `empty` and `fill` while we have a full factory function.
- Change `full_like` reference to call `full` to avoid defining another nvprim.
- Enable support for new_zeros to enable `cudnn_batch_norm` decomposition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89230
Approved by: https://github.com/kevinstephano, https://github.com/mruberry
2022-12-06 07:16:21 +00:00
PyTorch MergeBot
e645771e95 Revert "as_strided: Fix default storage_offset for reference implementation (#89513)"
This reverts commit ba70a8be03.

Reverted https://github.com/pytorch/pytorch/pull/89513 on behalf of https://github.com/kit1980 due to Broke multiple workflows, 2 unexpected successes for autograd tests
2022-12-06 07:14:16 +00:00
Arek Sredzki
44dac51c36 Improve Autograd Documentation Clarity (#89401)
This makes minor adjustments to the autograd docs, improving clarity and resolving grammatical errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89401
Approved by: https://github.com/kit1980
2022-12-06 06:45:04 +00:00
Manuel Candales
49ccc41d57 [Vulkan] Enable QInt8 and QInt32 quantization (#89788)
Summary: Enabled Vulkan quantization for dtypes QInt8 and QInt32

Test Plan:
On Mac
```
cd ~/fbsource
buck1 run -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAppleMac\#macosx-arm64
```

On Android
```
cd ~/fbsource
buck1 build -c ndk.custom_libcxx=false -c pt.enable_qpl=0 -c pt.vulkan_full_precision=1 //xplat/caffe2:pt_vulkan_quantized_api_test_binAndroid\#android-arm64 --show-output
adb push buck-out/gen/xplat/caffe2/pt_vulkan_quantized_api_test_binAndroid\#android-arm64 /data/local/tmp/vulkan_quantized_api_test
adb shell "/data/local/tmp/vulkan_quantized_api_test"
```

Differential Revision: D41561661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89788
Approved by: https://github.com/digantdesai
2022-12-06 06:27:40 +00:00
Andrew Gu
45b40be078 [FSDP()] Fix fully_shard fwd hook registration (#90201)
I need to rebase later after Shen's PRs land.

The idea is to only register the pre/post-forward hook on the _root modules_ among the modules that consume a `FlatParameter`. (Yes, the term _root module_ is heavily overloaded. We may want to clarify that at some point. Here, _root_ is being used in the graph sense, meaning parent-less, and the scope is only among the modules consuming a `FlatParameter`.)

This avoids unnecessary pre/post-forward hooks running, which would lead to errors because the unshard is not truly idempotent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90201
Approved by: https://github.com/mrshenli, https://github.com/rohan-varma
2022-12-06 06:09:03 +00:00
Sean Ross-Ross
2b7fcfa399 fix: Moving operators to FuncTorchBatchedDecomposition (#89762)
Some of the easy to move operators I've moved over and removed an xfail.

I found this from the test that I implemented in https://github.com/pytorch/pytorch/pull/89465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89762
Approved by: https://github.com/zou3519
2022-12-06 05:59:47 +00:00
Sean Ross-Ross
bb673fb1d9 fix: update error when tensor escapes vmap (#89077)
Fixes https://github.com/pytorch/functorch/issues/1054

@zou3519, I played around with it, but I am unsure of how to repro the cases for gen_vmap_inplace_plumbing and below in gen_vmap_plumbing_no_returns

I've also seen that there are 24 other instances of the `TORCH_INTERNAL_ASSERT(maybe_layer.has_value());` assert, should I change all of these and add tests?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89077
Approved by: https://github.com/zou3519
2022-12-06 05:52:09 +00:00
Wanchao Liang
2c2cce73d4 [dtensor] remove torchgen function schema and parse manually (#90106)
This PR get rids of torchgen FunctionSchema parsing and parse
it manually, it should resolve torchgen package issue and also
provide some perf wins when running DTensor eagerly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90106
Approved by: https://github.com/awgu
2022-12-06 05:45:00 +00:00
Yanli Zhao
a0c7b88861 remove backward hook in memory_tracker (#90143)
remove backward hook in memory_tracker, as it does not work well with jagged tensor in some cases, it is OK to remove this hook for now as it does not really track any stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90143
Approved by: https://github.com/rohan-varma
2022-12-06 05:39:59 +00:00
Sergii Dymchenko
6bbcd025bd Fix issue 38095 TODO in onnx/test_utility_funs.py (#90085)
Fix TODO related to https://github.com/pytorch/pytorch/issues/38095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90085
Approved by: https://github.com/BowenBao
2022-12-06 05:29:50 +00:00
Masaki Kozuki
508916128d [ReduceOp] ameliorate custom __eq__ (#90088)
Improve the completeness of `ReduceOp.__eq__`.

Should support the equal operator with the first argument of `RedOpType` and the second of `ReduceOp` in a follow-up.

Fixes #90072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90088
Approved by: https://github.com/kwen2501
2022-12-06 05:13:50 +00:00
Michael Lazos
2d9267ba30 [dynamo] Rewrite addcdiv in dynamo to its constituent ops (#90227)
This avoids a graph break when `value` is used. This fixes a graph break in the variants of Adam and Adagrad optimizers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90227
Approved by: https://github.com/jansel
2022-12-06 05:08:44 +00:00
Ram Rachum
77f9b2e8bf Fix exception causes in fx, nn and onnx packages (#90134)
This is a continuation of #90118

@kit1980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90134
Approved by: https://github.com/kit1980
2022-12-06 04:34:58 +00:00
fduwjj
31ec1a1ef7 [PT-D][Composability][1/N] Upstream NamedOptimizer from TorchRec (KeyedOptimizer in TR) (#89480)
In pytorch, the optim state_dict will always use number to index optimizer state_dict for parameters.

Now composability workstream need a FQN based way to index optimizer state_dict for parameters..

For example, SGD optimizer might have something in its `state_dict` like:

```
{'state':
  {0:
    {'momentum_buffer': tensor(...)},
  {1:
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': [0, 1, 2, 3, 4, 5, 6, 7]}]
}
```

And in NamedOptimizer we want the `state_dict` can be:

```
{'state':
  {'net1.0.weight':
    {'momentum_buffer': tensor(...)},
  {'net1.0.bias':
    {'momentum_buffer': tensor(...)},
  ...
}
'param_groups':
    [{'lr': 0.001, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': ['net1.0.weight', 'net1.0.bias', 'net2.0.weight', 'net2.0.bias', 'net3.weight', 'net3.bias', 'net4.1.weight', 'net4.1.bias']}]
}
```

We also want to support load_state_dict to enable optim `state_dict` override for NameOptimizer.

For the next couple PR/diffs, we also need to:
1. To make `NamedOptimizer` working with FSDP (like registering a hook for model wrapped with FSDP) and other PTD/PT components.
2. Make `NamedOptimizer` works well with apply_optim_in_backward
3. Upstream also `CombinedOptimizer`.

Differential Revision: [D41432088](https://our.internmc.facebook.com/intern/diff/D41432088/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D41432088/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89480
Approved by: https://github.com/rohan-varma
2022-12-06 04:34:19 +00:00
HDCharles
cee396fa07 [ao][ns] PNP demo for exposing arbitrary model transforms (#90153)
adding way to use arbitrary prepare and convert functions with PNP.

note this is a recreation of https://github.com/pytorch/pytorch/pull/89892 which was reverted due to landing not syncing between github and fbcode

python test/test_quantization.py
TestFxNumericSuiteNShadows.test_custom_functions_and_tracer

Differential Revision: [D41723892](https://our.internmc.facebook.com/intern/diff/D41723892/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90153
Approved by: https://github.com/vkuzo
2022-12-06 04:24:54 +00:00
Sherlock Huang
42705bd7b3 Disallow registering meta function for CompositeImplicitAutograd ops (#90222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90222
Approved by: https://github.com/ezyang
2022-12-06 04:22:31 +00:00
Natalia Gimelshein
a88400e0cc pad low precision matmuls when requested (#90235)
Matmul padding is beneficial not only for fp32, fp16/bf16 with amp can benefit as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90235
Approved by: https://github.com/jiawenliu64
2022-12-06 04:13:24 +00:00
Peter Bell
ba70a8be03 as_strided: Fix default storage_offset for reference implementation (#89513)
This fixes the default storage_offset to take it from the input. This was
previously untested, so I've also added a new OpInfo which includes samples with
non-zero storage_offsets on the input tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89513
Approved by: https://github.com/ezyang, https://github.com/ngimel
2022-12-06 04:07:16 +00:00
Danni Li
05ccbd6d94 Functionalization: skip meta block computation if compute_reference_meta is false (#90219)
Skip computing meta block when `compute_reference_meta` is `False`.

Issue: #89914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90219
Approved by: https://github.com/ezyang
2022-12-06 04:03:01 +00:00
Edward Z. Yang
962ebe88a2 Assert there are no outstanding side effects before calling cond (#90208)
The current cond implementation is silently incorrect when
there are outstanding side effects, since the locally tracked
side effects are lost when the recursive export call is made.
At least we raise an assert now.

I'm working on a refactor of cond which should be able to sidestep
this problem. Maybe.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: [D41746973](https://our.internmc.facebook.com/intern/diff/D41746973)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90208
Approved by: https://github.com/voznesenskym
2022-12-06 03:53:48 +00:00
PyTorch MergeBot
0d8e53dfe7 Revert "[Composable API] replicate: change to per module call, remove mark_root_module() (#89222)"
This reverts commit 65a0dcffd8.

Reverted https://github.com/pytorch/pytorch/pull/89222 on behalf of https://github.com/malfet due to Included unintended submodule updates
2022-12-06 03:26:28 +00:00
PyTorch MergeBot
73565ce320 [vision hash update] update the pinned vision hash (#90239)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/master/.github/workflows/_update-commit-hash.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90239
Approved by: https://github.com/pytorchbot
2022-12-06 03:25:17 +00:00
PyTorch MergeBot
3749b9dc73 Revert "[Composable API] replicate: add support for DDP args (#89243)"
This reverts commit 0f274ed385.

Reverted https://github.com/pytorch/pytorch/pull/89243 on behalf of https://github.com/malfet due to Depends on https://github.com/pytorch/pytorch/pull/89222 that introduced spurious module updates
2022-12-06 03:22:18 +00:00
XiaobingSuper
2597d5d722 TorchDynamo: always convert flexiblelayout to be FixedLayout when given a stride_order (#89904)
For convolution, we always call **require_stride_order** to convert the input to the target stride order,  if the original input's layout is flexiblelayout, there always have a memory copy because the **is_stride_order_storage_and_layout** only checks the init stride order,  I think for flexiblelayout, means it's layout can be changed, if the user gives a stride order, I think we always need to convert the flexiblelayout to be FixedLayout using given strider order.

Given a CV user case, the max_pooling's output is used by two convolutions, there has two memory copies:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0,
                       float* __restrict__ out_ptr1,
                       float* __restrict__ out_ptr2)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<9; i2+=1)
            {
                {
                    {
                        auto tmp0 = out_ptr0[i1 + (3*i2) + (27*i0)];
                        out_ptr1[i1 + (3*i2) + (27*i0)] = tmp0;
                        out_ptr2[i1 + (3*i2) + (27*i0)] = tmp0;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    buf2 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    buf4 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()), c_void_p(buf2.data_ptr()), c_void_p(buf4.data_ptr()))
    del arg4_1
    del buf0
    buf3 = torch.ops.mkldnn._convolution_pointwise(buf2, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg0_1
    del arg1_1
    del buf2
    buf5 = torch.ops.mkldnn._convolution_pointwise(buf4, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf5, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg2_1
    del arg3_1
    return (buf3, buf5, )
```

After this PR, the generated  code will remove the redundant memory copy:

```
kernel_cpp_0 = async_compile.cpp('''
#include "/tmp/torchinductor_xiaobing/77/c7773nj5pwikpmm2pwa62rcudlf7p3if7eyqb5k4sjsvewwje4le.h"
extern "C" void kernel(const float* __restrict__ in_ptr0,
                       float* __restrict__ out_ptr0)
{
    #pragma GCC ivdep
    for(long i0=0; i0<128; i0+=1)
    {
        #pragma GCC ivdep
        for(long i1=0; i1<3; i1+=1)
        {
            #pragma GCC ivdep
            for(long i2=0; i2<3; i2+=1)
            {
                #pragma GCC ivdep
                for(long i3=0; i3<3; i3+=1)
                {
                    {
                        {
                            auto tmp0 = in_ptr0[i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp1 = in_ptr0[3 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp3 = in_ptr0[6 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp5 = in_ptr0[21 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp7 = in_ptr0[24 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp9 = in_ptr0[27 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp11 = in_ptr0[42 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp13 = in_ptr0[45 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp15 = in_ptr0[48 + i3 + (6*i2) + (42*i1) + (147*i0)];
                            auto tmp2 = (tmp0 != tmp0) ? tmp0 : std::max(tmp1, tmp0);
                            auto tmp4 = (tmp2 != tmp2) ? tmp2 : std::max(tmp3, tmp2);
                            auto tmp6 = (tmp4 != tmp4) ? tmp4 : std::max(tmp5, tmp4);
                            auto tmp8 = (tmp6 != tmp6) ? tmp6 : std::max(tmp7, tmp6);
                            auto tmp10 = (tmp8 != tmp8) ? tmp8 : std::max(tmp9, tmp8);
                            auto tmp12 = (tmp10 != tmp10) ? tmp10 : std::max(tmp11, tmp10);
                            auto tmp14 = (tmp12 != tmp12) ? tmp12 : std::max(tmp13, tmp12);
                            auto tmp16 = (tmp14 != tmp14) ? tmp14 : std::max(tmp15, tmp14);
                            out_ptr0[i3 + (3*i2) + (9*i1) + (27*i0)] = tmp16;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    buf0 = empty_strided((128, 3, 3, 3), (27, 1, 9, 3), device='cpu', dtype=torch.float32)
    kernel_cpp_0(c_void_p(arg4_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg4_1
    buf2 = torch.ops.mkldnn._convolution_pointwise(buf0, arg0_1, arg1_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf2, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg0_1
    del arg1_1
    buf3 = torch.ops.mkldnn._convolution_pointwise(buf0, arg2_1, arg3_1, (0, 0), (1, 1), (1, 1), 1, 'none', [], '')
    assert_size_stride(buf3, (128, 3, 3, 3), (27, 1, 9, 3))
    del arg2_1
    del arg3_1
    return (buf2, buf3, )

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89904
Approved by: https://github.com/jansel
2022-12-06 03:07:53 +00:00
Bin Bao
29233a18c7 [inductor] Add test_ops_gradients running with inductor (#89792)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89792
Approved by: https://github.com/janeyx99, https://github.com/clee2000, https://github.com/huydhn
2022-12-06 02:26:29 +00:00
William Wen
ebeecbf833 Dynamo FX graph stack traceback fix (#87136)
Migration from https://github.com/pytorch/torchdynamo/pull/1655.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87136
Approved by: https://github.com/voznesenskym
2022-12-06 02:22:16 +00:00
Nikita Shulga
a268b9e53c Fix yet another C++17 Windows build issue (#90228)
Not sure why, but top-level `using namespace` directive causes VC++ fail with (if C++17 standard is used, but everything is fine with C++14):
```
C:\actions-runner\_work\pytorch\pytorch\third_party\pybind11\include\pybind11\detail\../pytypes.h(1520): error C2872: 'attr': ambiguous symbol
C:\actions-runner\_work\pytorch\pytorch\aten\src\ATen/core/interned_strings.h(349): note: could be 'c10::attr'
C:\actions-runner\_work\pytorch\pytorch\torch/csrc/jit/ir/ir.h(75): note: or       'torch::jit::attr'
C:\actions-runner\_work\pytorch\pytorch\cmake\..\third_party\pybind11\include\pybind11/pybind11.h(1094): note: see reference to function template instantiation 'pybind11::str pybind11::str::format<_Ty1&>(_Ty1 &) const' being compiled
        with
        [
            _Ty1=pybind11::handle
        ]
```

Solve this by replacing global `using namespace torch::jit;` with
specific usages of objects/methods from namespaces

Another prep change for https://github.com/pytorch/pytorch/70188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90228
Approved by: https://github.com/kit1980, https://github.com/albanD
2022-12-06 01:35:19 +00:00
Kimish Patel
55b10e6b1d [Pytorch][Vulkan] Use specalized shader for 3x3 depthwise conv (#89953)
This diff uses specialized implementation for 3x3 and 5x5 dw conv.

Differential Revision: [D41006638](https://our.internmc.facebook.com/intern/diff/D41006638/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89953
Approved by: https://github.com/salilsdesai, https://github.com/kirklandsign
2022-12-06 00:56:57 +00:00