Commit Graph

32467 Commits

Author SHA1 Message Date
Edward Z. Yang
f274c7b32c Add functional collective all_to_all_single and support it in Inductor (#110195)
Copy of https://github.com/pytorch/pytorch/pull/106655 from yf225
rebased on top of item() support changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110195
Approved by: https://github.com/Skylion007
2023-10-05 23:11:51 +00:00
Jon Chuang
df7d01aed5 perf(inductor): use for loop with shortcut in Optimizers to speedup against list comprehensions (e.g. complex conversion) (#110613)
Fully fixes: https://github.com/pytorch/pytorch/issues/110506

Depends: https://github.com/pytorch/pytorch/pull/110607
Potential merge conflicts:
- https://github.com/pytorch/pytorch/pull/110339
- https://github.com/pytorch/pytorch/pull/110345
- https://github.com/pytorch/pytorch/pull/110454

Related:
- https://github.com/pytorch/pytorch/issues/110606 (we can apply the improvements here orthogonally to the complex support)

### Results

Benchmark: 100 params.

Breakdowns (float32, dynamo):
```
Adagrad: this PR: 4.4s, main: 8.8s
Adam: this PR: 2.1s, main: 9.8s
AdamW: this PR: 2.5s, main: 8.2s
ASGD: this PR: 3.1s, main: 8.5s
RMSProp: this PR: 1.3s, main: 4.2s
RProp: this PR: 6.7s, main: 14.9s
```

Notes:
1. Adagrad is still slow due to `_get_value` list comprehension. Can be fixed in https://github.com/pytorch/pytorch/pull/110339/files by utilizing capturable path
2. Adamax is not actually compiled (it is currently disabled).
3. Inductor compile time is quite variable. We calculate dynamo by subtracting `call_user_compiler` from `compile_inner` timing.

<details>

This PR:
```
Adagrad (torch.float32): 28.47496461868286s
Adagrad (torch.complex64): 29.379547357559204s
Adam (torch.float32): 17.334211587905884s
Adam (torch.complex64): 29.637500524520874s
Adamax (torch.float32): 2.4749321937561035s
Adamax (torch.complex64): 3.1997995376586914s
AdamW (torch.float32): 18.06532859802246s
AdamW (torch.complex64): 28.25661015510559s
ASGD (torch.float32): 23.70255398750305s
ASGD (torch.complex64): 25.33756995201111s
RMSprop (torch.float32): 7.964028596878052s
RMSprop (torch.complex64): 12.909599781036377s
Rprop (torch.float32): 30.512362003326416s
Rprop (torch.complex64): 44.74405765533447s
```

Main
```
Adagrad (torch.float32): 26.919506072998047s
Adagrad (torch.complex64): 35.190622091293335s
Adam (torch.float32): 25.715000867843628s
Adam (torch.complex64): 24.17716670036316s
Adamax (torch.float32): 2.4404726028442383s
Adamax (torch.complex64): 3.3538928031921387s
AdamW (torch.float32): 25.2022807598114s
AdamW (torch.complex64): 28.915700912475586s
ASGD (torch.float32): 24.108731985092163s
ASGD (torch.complex64): 26.589075088500977s
RMSprop (torch.float32): 10.781344175338745s
RMSprop (torch.complex64): 15.136352777481079s
Rprop (torch.float32): 42.46482181549072s
Rprop (torch.complex64): 48.28277635574341s
```

Seems that it doesn't help the complex case by much (but that's not the majority case). torch.float32 is generally positive, when it does not show drastic improvement / regresses, it is due to inductor variance (by manually inspecting the logs).

</details>

### Benchmark Script
```python
import torch
import time
from torch.optim import Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop

OPTIMS = [Adagrad, Adam, Adamax, AdamW, ASGD, RMSprop, Rprop]
DTYPES = [torch.float, torch.cfloat]

NUM_PARAMS = 100
kwargs = { "lr": 0.01, "foreach": True }
summary = []

for optim_cls in OPTIMS:
    for dtype in DTYPES:
        torch._dynamo.reset()
        # torch._inductor.metrics.reset()
        input = torch.ones([10, 10], dtype=dtype, device="cuda:0")
        model = torch.nn.Sequential(
            *[torch.nn.Linear(10, 10, dtype=dtype, device="cuda:0") for _ in range(NUM_PARAMS)]
        )

        model(input).sum().abs().backward()
        opt_compiled = optim_cls(model.parameters(), **kwargs)
        compiled_step = torch.compile(opt_compiled.step)

        with torch.set_grad_enabled(False):
            start_time = time.time()
            compiled_step()
            summary.append(f"{optim_cls.__name__} ({dtype}): {time.time() - start_time}s")

        print(optim_cls, kwargs, dtype, torch._dynamo.utils.compile_times())

for s in summary:
    print(s)
```

CC: @janeyx99 @mlazos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110613
Approved by: https://github.com/janeyx99
2023-10-05 23:10:52 +00:00
Jerry Zhang
7b6042111f [quant][pt2e] Refactor conv related annotation for XNNPACKQuantizer (#110308)
Summary:
Since we changed IR that we are working with to pre autograd aten IR, it's easier
to use plain pattern match instead of relying on source_matcher_utils now, this
PR refactors the annotation for conv to use aten ops directly.

Also fixed reentrant test after this change.

Test Plan:
python test/test_quantization.py TestQuantizePT2E

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110308
Approved by: https://github.com/kimishpatel
2023-10-05 22:36:18 +00:00
albanD
cae537126f Set _diffThreshold on our TestCase (#110603)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110603
Approved by: https://github.com/albanD
2023-10-05 21:49:28 +00:00
Wanchao Liang
c95cf4b4c9 [dtensor] add grad placements kwarg to to_local API (#110629)
When we convert to local tensor, dtensor can't track autograd or
gradient layout of the local tensor anymore, if user do sth not expected, there
needs to be a way for user to hint about the gradient layout of the
local tensor
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110629
Approved by: https://github.com/zdevito
2023-10-05 21:34:01 +00:00
chilli
ada65508d2 Add option to flop counter formula registration to get raw values (#110591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110591
Approved by: https://github.com/awgu
ghstack dependencies: #110501, #110504
2023-10-05 21:14:41 +00:00
Scott Wolchok
9e72c9cccd [torch] easy missing move in aoti_runtime/model.h (#110469)
Just an extra shared_ptr copy, nothing fancy.

Differential Revision: [D49792510](https://our.internmc.facebook.com/intern/diff/D49792510/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110469
Approved by: https://github.com/Skylion007
2023-10-05 20:56:06 +00:00
William Wen
71beca4899 [dynamo, logging] Report name of defining class along side function name in Dynamo logs (#110190)
Implement https://github.com/pytorch/pytorch/issues/109236

Sample code:
```python
import torch

class AAA:
    class DUMMY:
        class DUMMY2:
            pass
    def dummy(self):
        def dummy2():
            pass
    class BBB:
        @staticmethod
        def CCC():
            class DDD:
                if True:
                    @staticmethod
                    def EEE():
                        x = [torch.ones(3, 3) for _ in range(5)]
                        return x
            return DDD

def fn():
    return AAA.BBB.CCC().EEE()

opt_fn = torch.compile(fn, backend="eager")

opt_fn()
```

Logs:
```bash
$TORCH_LOGS="trace_source" python playground2.py
[2023-09-27 17:38:35,641] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:21 in fn (fn)
[2023-09-27 17:38:35,641] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]     def fn():
[2023-09-27 17:38:35,642] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:22 in fn (fn)
[2023-09-27 17:38:35,642] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]         return AAA.BBB.CCC().EEE()
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:11 in CCC (AAA.BBB) (inline depth: 1)
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]             @staticmethod
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:13 in CCC (AAA.BBB.CCC.DDD) (inline depth: 1)
[2023-09-27 17:38:35,661] [0/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]                 class DDD:
[2023-09-27 17:38:35,723] [1/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG] TRACE starts_line /data/users/williamwen/pytorch/playground2.py:17 in <listcomp> (AAA.BBB.CCC.DDD.EEE)
[2023-09-27 17:38:35,723] [1/0] torch._dynamo.symbolic_convert.__trace_source: [DEBUG]                             x = [torch.ones(3, 3) for _ in range(5)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110190
Approved by: https://github.com/ezyang, https://github.com/mlazos
2023-10-05 20:41:38 +00:00
Jon Chuang
c99de9f37c fix(optim): adagrad sparse multitensor incorrect early exit (#110454)
Fixes https://github.com/pytorch/pytorch/issues/110444#issuecomment-1745181530

This PR:
Passes

Main:
```
test/optim/test_optim.py::TestOptim::test_adagrad_sparse FAILED [0.0058s]

==================================================================================================================================== FAILURES =====================================================================================================================================
__________________________________________________________________________________________________________________________ TestOptim.test_adagrad_sparse __________________________________________________________________________________________________________________________
Traceback (most recent call last):
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 1448, in test_adagrad_sparse
    self._test_rosenbrock_sparse(
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/test/optim/test_optim.py", line 128, in _test_rosenbrock_sparse
    self.assertEqual(params, params_c, atol=1e-6, rtol=1e-6)
  File "/home/jonch/Desktop/Programming/mlsys/pytorch/torch/testing/_internal/common_utils.py", line 3309, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 0.09999999999993325 at index (1,) (up to 1e-06 allowed)
Greatest relative difference: 0.06249999999996089 at index (1,) (up to 1e-06 allowed)

```

CC: @janeyx99
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110454
Approved by: https://github.com/janeyx99
2023-10-05 20:37:57 +00:00
CK Luk
ecdd1bcf03 Back out "[Inductor] Break the loop fusion when node2 depends on node1 mutations (#109172)" (#110622)
Summary:
Original commit changeset: 03980fb054d5

Original Phabricator Diff: D49519512

Bisecting shows that this diff is the cause of S369683. Since this affects Ads production, need to back out this diff immediately.

Test Plan: See S369683

Reviewed By: ezyang

Differential Revision: D49958638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110622
Approved by: https://github.com/yanboliang
2023-10-05 20:09:09 +00:00
Chien-Chin Huang
88616349d7 [state_dict][1/N] Implement the basic functions of distributed.checkpoint._state_dict (#105902)
This PR implements the basic functions of distributed.checkpoint._state_dict. This PR currently contains the flattening of optimizer state_dict which makes the PR too large. A later version may split it into 2 for a better code review.

Differential Revision: [D47647719](https://our.internmc.facebook.com/intern/diff/D47647719/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D47647719/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/105902
Approved by: https://github.com/wz337
2023-10-05 20:04:15 +00:00
Bin Bao
298f01d9a2 [aotinductor] Avoid generating redundant kernel loading code (#110510)
Summary: 1) Stop forcing triton.unique_kernel_names to True for AOTInductor, because the unique kernel name can be read from metadata; 2) Only generate load_kernel once for each kernel since we don't have control flow in our generated code.  This solves https://github.com/pytorch/pytorch/issues/105553.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110510
Approved by: https://github.com/chenyang78, https://github.com/jansel
2023-10-05 19:59:38 +00:00
Sherlock Huang
f1b94461aa [AOTInductor] ProxyExecutor support Dynamic Shape (#110526)
Summary:
Extend ProxyExecutor to support dynamic shape.

Example of ProxyExecutor invocation with symints.
```
    int64_t* arg0_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg0_1, &arg0_1_size));
    auto s0 = arg0_1_size[0];
    auto s1 = arg0_1_size[1];
    int64_t* arg1_1_size;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_sizes(arg1_1, &arg1_1_size));
    auto s2 = arg1_1_size[0];
    auto s3 = arg1_1_size[1];
    ...
    aoti_torch_proxy_executor_call_function(proxy_executor, 0, 15, std::vector<int64_t>{42, 16, 17, s0 + s1, s0 + s1, s2*s3, 45, 67, 16, 17, s2*s3, s2*s3, s0 + s1, 89, 910}.data(), 7, std::vector<AtenTensorHandle>{arg0_1, arg0_1, arg1_1, buf2, arg0_1, arg1_1, buf4}.data());
```

Example of serialized SymInt(s) arguments:
```
          {
            "name": "symint",
            "arg": {
              "asSymInt": {
                "asName": "s0 + s1"
              }
            }
          },
          {
            "name": "symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s0 + s1"
                },
                {
                  "asName": "s2*s3"
                }
              ]
            }
          },
          ...
          {
            "name": "o_symint",
            "arg": {
              "asSymInt": {
                "asName": "s2*s3"
              }
            }
          },
          {
            "name": "o_symints",
            "arg": {
              "asSymInts": [
                {
                  "asName": "s2*s3"
                },
                {
                  "asName": "s0 + s1"
                }
              ]
            }
          },
```

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Differential Revision: D49887555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110526
Approved by: https://github.com/chenyang78
2023-10-05 19:05:20 +00:00
Dmytro Dzhulgakov
a0cea517e7 Add 9.0a to cpp_extension supported compute archs (#110587)
There's an extended compute capability 9.0a for Hopper that was introduced in Cuda 12.0: https://docs.nvidia.com/cuda/archive/12.0.0/cuda-compiler-driver-nvcc/index.html#gpu-feature-list

E.g. Cutlass leverages it: 5f13dcad78/python/cutlass/emit/pytorch.py (L684)

This adds it to the list of permitted architectures to use in `cpp_extension` directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110587
Approved by: https://github.com/ezyang
2023-10-05 17:41:06 +00:00
Antoni Viros i Martin
efdf155383 Add requirement for input to AllGatherIntoTensor to be contiguous (#109561)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109561
Approved by: https://github.com/Chillee
2023-10-05 17:04:48 +00:00
Catherine Lee
d6e5898e8d Quieter logs in CI (#110033)
To reduce the amount of logs
* for successes, only print the part that says what tests ran and don't print the rest.  Zip the log into an artifact.  The line listing al the test names is really long, but if you view source of the raw logs, it will not wrap so it will only be one line.  The log classifier can also be configured to ignored this line. Gets rid of lines like `test_ops.py::TestCommonCPU::test_multiple_devices_round_cpu_int64 SKIPPED [0.0010s] (Only runs on cuda) [  9%]`
* for failures/reruns, print logs.  Do not zip.

Also
* change log artifact name

Examples of various logs:
a074db0f7f failures
1b439e24c4 failures

possibly controversial haha
should i include an option for always printing?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110033
Approved by: https://github.com/huydhn
2023-10-05 16:40:37 +00:00
ydwu4
cc1de49340 [HigherOrderOp] fallthrough some keys by default. (#110478)
Fixes #109253

Test Plan:
Added a new test that shows default fallthrough keys can be overrided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110478
Approved by: https://github.com/ezyang
2023-10-05 16:25:42 +00:00
Jason Park
26f634eefb Enable aarch64 for fixing undefined symbol error. (#110542)
Summary: ARM can be safely supported

Reviewed By: andrewjcg, aaronenyeshi

Differential Revision: D49921679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110542
Approved by: https://github.com/aaronenyeshi
2023-10-05 16:16:06 +00:00
chilli
f767a6c57a Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504
Approved by: https://github.com/mlazos, https://github.com/eellison
ghstack dependencies: #110501
2023-10-05 15:47:30 +00:00
PyTorch MergeBot
1e4c0641ce Revert "Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)"
This reverts commit 9648df1a6a.

Reverted https://github.com/pytorch/pytorch/pull/110504 on behalf of https://github.com/PaliC due to temporarily will revert as it's causing problems with difftrain import ([comment](https://github.com/pytorch/pytorch/pull/110504#issuecomment-1749132253))
2023-10-05 15:28:23 +00:00
Chien-Chin Huang
1a729618ef [FSDP][optim_state_dict] Make the new optimizer allgather fusion work with fine-tuning models (#110540)
With use_orig_params=True, it is possible that some parameters with the same FlatParameter are in the optimizer while others parameters are frozen. This PR makes the allgather fusion logic support the case.

Differential Revision: [D49922028](https://our.internmc.facebook.com/intern/diff/D49922028/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110540
Approved by: https://github.com/awgu, https://github.com/rohan-varma
2023-10-05 15:17:10 +00:00
Joel Schlosser
f17fe89e14 Multiprocessing support for NT (#110292)
Fixes #110161

Allows NTs to be used in DataLoaders with `num_workers > 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110292
Approved by: https://github.com/cpuhrsch, https://github.com/albanD
2023-10-05 15:04:48 +00:00
Andrew Or
7c72238e4b Back out "Enable pickling model prepared with QAT qconfig" (#110392)
Summary:
D49187352 caused our model conversion and loading of QAT checkpoint to be stuck with thrift time out.

we are actively checking in final code and model for static quant HTP prod model, and encountered this breakage at head Thursday.

Thrift timeout is a not failing, and because of that, it's hard to bisect and find this culprit. It is also hard to set up unit test, because the job simply time-out. Better test is needed to guard downstream model conversion against upstream changes.

Our suspicion of why this diff broke us is that we create a lot of modules with qat (in a recursive manner) but our model is not a qat traceable module (it is a graph with many qat modules and floating point modules). With fuctools.partial as in the original diff, we will be caching modules in the memory and causing the memory of the machine to be taken up completely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110392
Approved by: https://github.com/junesg, https://github.com/jerryzh168
2023-10-05 14:41:00 +00:00
Oleg Khabinov
cf1b494afd [AOTInductor] Store loaded kernels in the model (#110554)
Defining kernels as static vars is problematic for subsequent model loading on non-default CUDA devices.

Assuming those kernels were loaded in context of the device #0, so, they are not nullptr anymore, therefore kernels won't work on devices other than the device #0.

This change makes devices remembered at model level in AOT mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110554
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2023-10-05 10:17:05 +00:00
Sehoon Kim
c36b31d530 torch::nn::AdaptiveLogSoftmaxWithLoss: check length of cutoffs (#106777)
Fixes #106698

Also added a check for python API, because current error message
```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sehoon/pytorch-latest/torch/nn/modules/adaptive.py", line 128, in __init__
    or (min(cutoffs) <= 0) \
ValueError: min() arg is an empty sequence
```
is not very comprehensible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/106777
Approved by: https://github.com/albanD
2023-10-05 05:35:47 +00:00
Avik Chaudhuri
416eca9736 export db links for user errors (#110555)
Ideally all `_dynamo.exc.UserError`s should have "case names", i.e., link to examples in `exportdb`.

This PR adds case names to several instances of `_dynamo.exc.UserError`. In particular, looking at coverage based on `UserErrorType`:
* `DYNAMIC_CONTROL_FLOW`, `ANTI_PATTERN`, and `STANDARD_LIBRARY` are fully covered.
* `CONSTRAINT_VIOLATION` and `DYNAMIC_DIM` have no coverage. We don't seem to have any dedicated examples of specifying dynamic shapes in `exportdb` (although they are used in some other examples without explanation, to avoid some specialization that would make such examples moot).
* `INVALID_INPUT` is only partly covered. Frankly this is tedious to cover via examples.

Differential Revision: [D49928518](https://our.internmc.facebook.com/intern/diff/D49928518/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110555
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2023-10-05 05:03:04 +00:00
PyTorch MergeBot
21019620ee Revert "[Dynamo] SizeVariable can be indexed by symint (#110349)"
This reverts commit 510ec7e3c5.

Reverted https://github.com/pytorch/pytorch/pull/110349 on behalf of https://github.com/PaliC due to breaking internal tests (check diff) ([comment](https://github.com/pytorch/pytorch/pull/110349#issuecomment-1748021641))
2023-10-05 04:42:33 +00:00
andrewor14
62cad5b5b0 [quant][pt2] Support cudnn_batch_norm in QAT fusion (#109908)
Summary: Today, we get different batch norm ops depending on
the device the model is placed on at export time. Exporting
`model.cpu()` gives `_native_batch_norm_legit`, while exporting
`model.cuda()` gives `cudnn_batch_norm`. QAT fusion currently
only supports the former and silently ignores the latter. This
commit fixes this by additionally matching on the latter op
during QAT fusion.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_fusion
python test/test_quantization.py TestQuantizePT2EQAT.test_qat_conv_bn_relu_fusion

Reviewers: jerryzh168, kimishpatel

Subscribers: jerryzh168, kimishpatel, supriyar

Differential Revision: [D49615145](https://our.internmc.facebook.com/intern/diff/D49615145)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109908
Approved by: https://github.com/jerryzh168
2023-10-05 04:08:44 +00:00
lezcano
4b1e138162 [dynamo] [easy]Remove InstructionTranslator from within Set (#110521)
I believe this was a left over from the before times. See if CI agrees.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110521
Approved by: https://github.com/ezyang
2023-10-05 04:01:18 +00:00
Kazuaki Ishizaki
434a996c42 Fix typo under torch/_inductor directory (#110530)
This PR fixes typo of comments and messages in files under `torch/_dynamo` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110530
Approved by: https://github.com/kit1980
2023-10-05 02:17:20 +00:00
chilli
9648df1a6a Made pattern-matcher diagnostics lazily reported + added TORCH_COMPILE_CPROFILE (#110504)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110504
Approved by: https://github.com/mlazos, https://github.com/eellison
ghstack dependencies: #110501
2023-10-05 01:34:57 +00:00
chilli
e686341f64 Consider that ops can be fused into cat in the min-cut partitioner (#110501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110501
Approved by: https://github.com/eellison
2023-10-05 01:34:57 +00:00
Justin Chu
d24e7be243 Include onnx and onnxscript information in collect_env.py (#110560)
`onnx` and `onnxscript` are used in torch.onnx.dynamo_export since 2.0. It would be helpful to collect version information in user issue reports.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110560
Approved by: https://github.com/albanD
2023-10-05 01:29:04 +00:00
Amadeusz Skrzypczak
653f966df0 Fix type promotion of float8_e5m2 and float8_e4m3fn (#110279)
There is an issue with float8 type promotion, because _promoteTypesLookup doesn't contain records for few types between bfloat16 and float8.
I have simply moved float8 types just after bfloat16, however I'm not sure if it doesn't break serialization.

Please, decide if it can stay like this, or should I insert missing records filled with "ud" into _promoteTypesLookup instead of moving types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110279
Approved by: https://github.com/albanD
2023-10-05 01:28:48 +00:00
Edward Z. Yang
6a974bec5d Change flash attention outputs to be SymInt instead of int (#110533)
Fixes https://github.com/pytorch/pytorch/issues/110322

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110533
Approved by: https://github.com/albanD
2023-10-05 01:00:07 +00:00
Edward Z. Yang
f1d81134ef Print output type if assert fires (#110534)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110534
Approved by: https://github.com/albanD
2023-10-05 00:59:17 +00:00
Mihir Patel
95c59b30b8 Update fully_sharded_data_parallel to fix typing (#110545)
Fixes typing so that linter does not complain when using CustomPolicy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110545
Approved by: https://github.com/awgu, https://github.com/Skylion007
2023-10-05 00:00:10 +00:00
Xuehai Pan
0daa7d4815 [test][docs] Fix doctest warnings for syntax errors (#110517)
Fixes some syntax errors in doctest find in CI tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110517
Approved by: https://github.com/albanD
2023-10-05 00:00:06 +00:00
Fabrice Pont
053367b1ed fix: flake8-bugbear code B024 (#107265)
See #106571 item B024

This fix concerns the addition of `abstractmethod` to methods declared inside abstract classes.

Should I also include PEP8 compliant reformatting on the files I had to modify ?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107265
Approved by: https://github.com/kit1980
2023-10-04 23:52:52 +00:00
Xuehai Pan
449271f3f1 [pytree] Extract reusable generic tests for pytree (#110395)
Part of #109684

- #109684

Changes:

- Add new functions `tree_structure`, `tree_leaves`, `tree_map_` and `tree_map_only_` to Python pytree.
- Extract reusable tests for pytree to `TestGenericPytree`.
- Change `treespec_dumps` and `treespec_loads` in C++ pytree to call Python pytree and use JSON string as serialization type.
- Rename `torch.utils.pytree` -> `torch.utils._cxx_pytree`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110395
Approved by: https://github.com/zou3519
2023-10-04 23:40:50 +00:00
Jon Chuang
37afa0c349 fix(inductor): Increase coverage of Inductor ATen lowering (#110473)
Add sqrt to decomp testing path and fix missing `minimum`, `clamp_min`,`clamp_max` lowerings and/or registrations.

Follow up to: https://github.com/pytorch/pytorch/pull/110468#issuecomment-1745718602 (requires upstream to merge to avoid merge conflict)

CC: @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110473
Approved by: https://github.com/janeyx99
2023-10-04 23:40:46 +00:00
Xu Zhao
2e31fae5c5 Cleanup the code in the dynamo userbenchmark (#110519)
Summary:
Skip importing the modules that are only available in the pytorch source code, not pytorch nightly release.

Make dynamo benchmark work on both OSS and internal.

X-link: https://github.com/pytorch/benchmark/pull/1960

Test Plan:
```
$ python run_benchmark.py dynamo --only alexnet --training --performance --inductor
loading model: 0it [00:05, ?it/s]
cuda train alexnet
running benchmark: 100%|█████████████████| 30/30 [00:00<00:00, 41.46it/s]
1.129x
```

```
$ buck2 run mode/opt //pytorch/benchmark:run_benchmark -- dynamo --only alexnet --training --inductor --performance --output-directory $HOME
loading model: 0it [00:16, ?it/s]
running benchmark: 100%|█████████████████| 30/30 [00:00<00:00, 37.94it/s]
cuda train alexnet
1.120x
```

Differential Revision: D49912006

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110519
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-10-04 23:26:30 +00:00
Howard Huang
0949d97c16 fix batch_isend_irecv example incorrect usage (#110408)
mismatched dtypes silently leads to wrong outputs in nccl

```
1:recv_tensor=tensor([0., 0.], device='cuda:1')
0:recv_tensor=tensor([2.8026e-45, 0.0000e+00], device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110408
Approved by: https://github.com/awgu, https://github.com/Neilblaze
2023-10-04 22:57:03 +00:00
soulitzer
8672d64fed Use is_symbolic instead of testing isinstance in some place (#110372)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110372
Approved by: https://github.com/ezyang
ghstack dependencies: #110044, #110369, #110370, #110371
2023-10-04 22:56:42 +00:00
soulitzer
e1cfcdfa06 Symintify guards.cpp (#110371)
Separating this out so we can check perf more easily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110371
Approved by: https://github.com/ezyang
ghstack dependencies: #110044, #110369, #110370
2023-10-04 22:56:42 +00:00
soulitzer
a7145cb3a4 Add symbolic singleton int (#110370)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110370
Approved by: https://github.com/ezyang
ghstack dependencies: #110044, #110369
2023-10-04 22:56:26 +00:00
soulitzer
eb8feb8ff8 Support SingletonSymNode mul with coefficient (#110369)
We want to be able to use SingletonSymNode to represent strides for Jagged layout tensor. The following is for 3D, but easily generalizable to higher dimensions.

Constraints:
- [B, x, D] (where x represents the "variably lengthed dim") can be strided in two ways [x, 1, sum(x)] and [dx, d, 1]. We need two different placeholder values depending on how the jagged tensor is strided.
- When doing operations we need the strides of output tensors to be expressable in terms of the strides and sizes of the inner tensors. Given [B, x, D] @ [D, D'], the output strides is [x * D', D', 1] rather than some opaque [x2, D', 1]. This constraint exists because if I'm tracing, I need a symint to represent the output stride. This symint needs to come from somewhere; I get it in several ways: (1) create a constant, (2) unbacked symint, (3) create a new input using a source, (4) output of an operation on an existing symint. It is clear that (4) is what we want here, which brings us to the design below.

Design:

Given the two constraints, the most straightforward way to implement this is actually to update SingletonSymNode to include some scalar factor, i.e. Morally, SingletonSymNode represents `factor * [s_0, s_1, …, s_n]` This enables us to symbolically compute strides from sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110369
Approved by: https://github.com/ezyang
ghstack dependencies: #110044
2023-10-04 22:56:15 +00:00
soulitzer
4e73eee93f Update custom Function preserve torch function when inputs returned as-is (#109825)
Fixes https://github.com/pytorch/pytorch/issues/109805
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109825
Approved by: https://github.com/albanD
2023-10-04 22:45:11 +00:00
Avik Chaudhuri
6fc09aee36 constant output errors (#110472)
When mapping between the original signature of a program and the graph-captured signature of its exported program, we emit errors when we see unexpected original or graph-captured inputs or outputs.

These errors can arise because of various reasons, e.g.:
1. some input or output has been lifted because of mutation
2. some type is not pytree-registered for flattening / unflattening
3. some type cannot be realized with graph operations

(This is probably not an exhaustive list.)

Previously we used to emit errors based on a vanilla id-based membership check between the two sides, mostly anticipating (1) as the reason for errors. But this does not do justice to errors because of (2) or (3).

This PR emits a different error when it finds (3) to be a probable cause. Specifically, it considers only Tensor and Sym* types to be "supported": no other type seems to be realizable by graph operations.

When (2) is a probable cause, we sometimes also hit the same error because we would expect the supported types to show through upon registration. But this kind of error may need some more work in the future.

Differential Revision: [D49885828](https://our.internmc.facebook.com/intern/diff/D49885828/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110472
Approved by: https://github.com/ydwu4
2023-10-04 21:56:20 +00:00
Bert Maher
a9df9e5187 [inductor] get_system shouldn't error if CUDA is not installed (#110282)
Using inductor on a CPU-only device should be OK.

Differential Revision: [D49749912](https://our.internmc.facebook.com/intern/diff/D49749912/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110282
Approved by: https://github.com/desertfire
2023-10-04 21:28:55 +00:00
ydwu4
6db3853eeb Add doc for torch.cond (#108691)
We add a doc for torch.cond. This PR is a replacement of https://github.com/pytorch/pytorch/pull/107977.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108691
Approved by: https://github.com/zou3519
2023-10-04 21:24:14 +00:00
Yang Chen
46a5558cd5 [AOTInductor] Simplified AOTInductor interface and model class (#110411)
Summary:
This PR removed several APIs from the AOTInductor interface,
which are not used by the client.

It also simplified AOTInductor's model class by removing
the dim info for input/output tensors. We included dim info
before to return max output shapes, which was used by the client
to allocate memory for output tensors. Now, we allocate output
tensor memory from the .so so that we don't need to maintain
such information any more. The deletion of dim info from
the model class also simplified the codegen quite a bit.

Test Plan: ci

Reviewed By: khabinov

Differential Revision: D49835430

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110411
Approved by: https://github.com/khabinov, https://github.com/desertfire, https://github.com/jansel
2023-10-04 18:35:24 +00:00
Oguz Ulgen
f04b1a0d27 [AOTInductor] Implement autograd eager backend for native triton kernels (#110403)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110403
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2023-10-04 17:56:56 +00:00
Bin Bao
c0c2e052a4 [aotinductor] Clean up fallback kernel cpp name generation (#110267)
Summary: Unify the way to generate cpp kernel name when the kernel is from OpOverload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110267
Approved by: https://github.com/zou3519
ghstack dependencies: #110233
2023-10-04 17:18:02 +00:00
Bin Bao
539367f0bc [aotindutor] Refactor optional value codegen (#110233)
Summary: Simplify the codegen for optional values by using c10::nullopt, and we don't need placeholders like OptionalScalar because we can simply use None for that purpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110233
Approved by: https://github.com/jansel
2023-10-04 17:18:02 +00:00
Shiyan Deng
247c574313 [jit] make register parameter/buffer thread safe in torch::jit::Module (#110488)
Summary: Registering param/buffer will write into a vector inside Object, need to maintain thread safety if we have threads reading from the vector and writing to the vector at the same time.

Test Plan: CI

Differential Revision: D49882601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110488
Approved by: https://github.com/davidberard98
2023-10-04 17:04:23 +00:00
Kazuaki Ishizaki
2c1b009e39 Fix typo under torch/_dynamo directory (#110459)
This PR fixes typo of comments in files under `torch/_dynamo` directory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110459
Approved by: https://github.com/colesbury
2023-10-04 16:05:05 +00:00
Bert Maher
4c3d3b7176 [inductor] Lower small gemvs on CPU (#110456)
If the gemv fits in registers, like [1,16]*[16,16], MKL isn't going to
do much better than compiling a simple for-loop, and we end up paying
allocation overhead and ATen overhead.

A very small internal inference model drops from 7->5 us with this change.

Differential Revision: [D49875991](https://our.internmc.facebook.com/intern/diff/D49875991/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110456
Approved by: https://github.com/chenyang78, https://github.com/jgong5
2023-10-04 15:16:38 +00:00
Banit Agrawal
30c4c6ff9b [PyTorch CCA] Refactor caching allocator config code (#110123)
Summary: This diff refactors the code by moving CUDAAllocatorConfig into the header file. This config refactoring is done so that we can use the same config code for CUDA pinned memory as well.

Test Plan: sandcastle

Differential Revision: D49653265

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110123
Approved by: https://github.com/zdevito
2023-10-04 14:58:23 +00:00
PyTorch MergeBot
156aefa89b Revert "[3/N] Add -Wdeprecated and related fixes (#109698)"
This reverts commit c31fcdaa4f.

Reverted https://github.com/pytorch/pytorch/pull/109698 on behalf of https://github.com/PaliC due to breaking quantization tests ( quantization/test_quantize_per_channel_sub_byte and  quantization/test_quantize_per_channel_float_qparams) internally ([comment](https://github.com/pytorch/pytorch/pull/109698#issuecomment-1746999806))
2023-10-04 14:33:47 +00:00
Yukio Siraichi
0e55cc4986 [HigherOrderOp] Flatten outputs of wrap. (#109433)
Fix: #109247

This PR flattens `wrap` outputs by inlining `pytree.tree_flatten` function after calling
the inner function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109433
Approved by: https://github.com/zou3519
ghstack dependencies: #110290
2023-10-04 13:43:55 +00:00
Raphael Reme
9f0601df6d Fix a typo in cholesky_inverse documentation (#110364)
Very small PR to fix a typo in [https://pytorch.org/docs/stable/generated/torch.cholesky_inverse.html](cholesky_inverse) doc.

According to the current doc, the function expects $A$, the symmetric positive-definite matrix, as input. But the examples given (and more important, the code) is using $u$ the cholesky decomposition of this matrix (like cholesky_solve).

Also, it provides a correct example of batch usage of this function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110364
Approved by: https://github.com/lezcano
2023-10-04 12:30:11 +00:00
Ken Jin
31d635803b [Dynamo] Fx proxy for builtin all with list iterators (#109972)
Fixes https://github.com/pytorch/pytorch/issues/109057.
Fixes https://github.com/pytorch/pytorch/issues/103620.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109972
Approved by: https://github.com/ezyang
2023-10-04 07:59:26 +00:00
Yu Guo
2bf3ca1be7 [torchdynamo] preserve deterministic_algorithms_warn_only in convert_context (#110457)
Summary: preserve deterministic_algorithms_warn_only  in dynamo context

Test Plan: modified unit tests to test warn_only

Differential Revision: D49872622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110457
Approved by: https://github.com/jansel
2023-10-04 07:12:32 +00:00
Jez Ng
dddf581da7 [dynamo] Add graph break on requires_grad_() (#110053)
Fixes #107861.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110053
Approved by: https://github.com/eellison
2023-10-04 06:22:16 +00:00
Xiaodong Wang
562c68e56f [nccl] denoise warning msg (#110433)
Summary: This is too noisy for anything set with TORCH_NCCL_USE_COMM_NONBLOCKING. Just warn once.

Test Plan: GH CI

Differential Revision: D49846339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110433
Approved by: https://github.com/awgu
2023-10-04 06:21:53 +00:00
Jon Chuang
3fd938369f add foreach_abs meta registration and inductor decomp (#110468)
Fixes https://github.com/pytorch/pytorch/issues/110458

Somehow it is on allowlist but not on testing path.

CC @janeyx99

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110468
Approved by: https://github.com/janeyx99
2023-10-04 06:09:37 +00:00
Max Ren
08c7dcda65 [pt2e][xnnpack_quantizer] quantize "mul" (#110428)
Adding "mul" to list of partitions that are supported by the quantizer. This shows up in EDSR, where we still want to quantize the mul op

Differential Revision: [D49850151](https://our.internmc.facebook.com/intern/diff/D49850151/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110428
Approved by: https://github.com/jerryzh168
ghstack dependencies: #110427
2023-10-04 05:11:53 +00:00
Max Ren
66202ed29c [pt2e][xnnpack_quantizer] add util function to convert scalars to attrs (#110427)
Jerry provided a notebook solution for converting scalars to attrs so that they may be properly quantized:

https://fburl.com/anp/kzz7tfn1

Adding this pass as a util function in xnnpack_quantizer_utils.py

Differential Revision: [D49850150](https://our.internmc.facebook.com/intern/diff/D49850150/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110427
Approved by: https://github.com/jerryzh168
2023-10-04 05:11:53 +00:00
chilli
005e8ddcb9 cache the hash construction on Guard (#110464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110464
Approved by: https://github.com/zou3519, https://github.com/voznesenskym
2023-10-04 04:49:18 +00:00
zdevito
3fe3439242 Use LLVMSymbolizer directly for unwind inside fbcode (#108800)
Using LLVMSymbolizer directly avoids having to call fork which has caused timeouts in some circumstances.

Differential Revision: [D49070589](https://our.internmc.facebook.com/intern/diff/D49070589/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108800
Approved by: https://github.com/aaronenyeshi
2023-10-04 04:04:08 +00:00
Yanbo Liang
510ec7e3c5 [Dynamo] SizeVariable can be indexed by symint (#110349)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110349
Approved by: https://github.com/williamwen42
2023-10-04 03:20:18 +00:00
Sherlock Huang
50054b1a62 [AOTInductor] ProxyExecutor support ReinterpretView inputs (#110451)
Summary:
See wrapper.codegen_reinterpret_view(), it return a temporary handle for tensor, which has following problem.
```
            # NB, the return handle here represents a temporary tensor, which will be automatically
            # released.
            # Here's a sample usage in the cpp wrapper code:
            # ```
            # aoti_torch_addmm_out(
            #     buf1,
            #     arg1_1,
            #     RAIIAtenTensorHandle(tmp_tensor_handle_0),
            #     buf0,
            #     1L,
            #     1L));
            # ```
            # RAIIAtenTensorHandle(tmp_tensor_handle_0) will be released after the call to addmm_out.
            # This could be problematic when it's used in a different pattern, for example:
            # ````
            # AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
            # aoti_torch_proxy_executor_call_function(..., tensor_args);
            # ````
            # RAIIAtenTensorHandle(tmp_tensor_handle_2) will be invalid when it's used in the latter
            # kernel call.
            return f"RAIIAtenTensorHandle({tmp_name})"
```

As a result, ProxyExecutor would generate following code, which cause invalid memory access.

Before:

```
    // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
    AtenTensorHandle tmp_tensor_handle_2;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
    ...
    AtenTensorHandle tensor_args[] = {RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6};
    int64_t int_args[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args, 3, tensor_args);
    buf3.reset();
```

With fix in this diff, ProxyExecutor generates following code

After:

```
    // Source Nodes: [fn_with_tuple_output], Original ATen: [fb.fn_with_tuple_output]
    AtenTensorHandle tmp_tensor_handle_2;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch__reinterpret_tensor(buf3, 2, int_array_0, int_array_1, 0L, &tmp_tensor_handle_2));
    ...
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, std::vector<int64_t>{1}.data(), 3, std::vector<AtenTensorHandle>{RAIIAtenTensorHandle(tmp_tensor_handle_2), buf5, buf6}.data());
    buf3.reset();
```

I am not exactly a big fan of such `std::vector{...}.data()` for creating a temp array, but I can't think of another fix.

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Reviewed By: desertfire

Differential Revision: D49758764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110451
Approved by: https://github.com/desertfire
2023-10-04 02:20:31 +00:00
eellison
dd95eaaf1a turn back on constant folding in fbcode (#108604)
Differential Revision: [D49020794](https://our.internmc.facebook.com/intern/diff/D49020794)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108604
Approved by: https://github.com/davidberard98, https://github.com/mlazos
2023-10-04 02:13:03 +00:00
Howard Huang
efb73fe8e4 Fix send()/recv() to adhere to timeout (#109611)
Summary: Point to point ops don't enqueue their work to the `workMetaList_` which means that the NCCL watchdog does not watch over them, hence they do not respect the collective timeouts.

Test Plan:
While trying to add a test I found we dont have tests which validate the nccl watch dog. It looks like this is because we dont have a good way to detect when nccl watchdog has thrown an error (exception is thrown in a side thread) in our testing framework / `MultiprocessTestCase`

I manually tested this change with the script in https://github.com/pytorch/pytorch/issues/109401, but need to look more closely at how to automate a test for NCCL watchdog

Differential Revision: D49418976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109611
Approved by: https://github.com/wconstab
2023-10-03 23:27:45 +00:00
Xiaodong Wang
a0bffe7ed7 [S366352] Print nccl version during initialization (#110305)
Summary: print nccl version during initialization

Differential Revision: D49603220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110305
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/rohan-varma
2023-10-03 23:09:48 +00:00
cyy
c31fcdaa4f [3/N] Add -Wdeprecated and related fixes (#109698)
This PR follows #108626. Hopefully we can enable the warning in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109698
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2023-10-03 22:50:53 +00:00
Mu-Chu Lee
836ba6430a [AOTInductor] Initial functionality for Inf and NaN checker (#109526)
Summary:
Add initial functionality for Inf and NaN checker for AOTInductor.

Test Plan:
Included in commit. Skipped for CI as SIGABRT can't be captured by pytest.

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D49379751](https://our.internmc.facebook.com/intern/diff/D49379751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109526
Approved by: https://github.com/chenyang78
2023-10-03 22:39:42 +00:00
eellison
98c8550158 Fix Triplet Margin Loss Opinfo (#110302)
Triplet Margin Loss takes in a Callable `distance_function` parameter which is not supported as an argument on the fx graph. See previous error:

> File "/scratch/eellison/work/pytorch/torch/_dynamo/symbolic_convert.py", line 562, in call_function
self.push(fn.call_function(self, args, kwargs))
File "/scratch/eellison/work/pytorch/torch/_dynamo/variables/torch.py", line 723, in call_function
*proxy_args_kwargs(args, kwargs),
File "/scratch/eellison/work/pytorch/torch/_dynamo/utils.py", line 504, in proxy_args_kwargs
f"call_function args: {typestr(*args)} {typestr(*list(kwargs.values()))}"
File "/scratch/eellison/work/pytorch/torch/_dynamo/exc.py", line 143, in unimplemented
raise Unsupported(msg)
torch._dynamo.exc.Unsupported: call_function args: TensorVariable() TensorVariable() TensorVariable() ConstantVariable(float) NNModuleVariable()

This is fixable by just inlining into `triplet_margin_loss` and continuing to compile it. This required support for `has_torch_function_variadic`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110302
Approved by: https://github.com/mlazos
2023-10-03 20:26:13 +00:00
Peter Bell
dc794ec32c [dynamo] Trace through builtin abs (#110398)
In python `abs(x)` does nothing but delegate to `x.__abs__()` so we should do
the same in dynamo. This also adds `SymNode.__abs__` so we can trace through
indexing expressions involving `abs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110398
Approved by: https://github.com/jansel, https://github.com/lezcano
2023-10-03 19:25:37 +00:00
Pruthvi Madugundu
9ce2e02fd6 Revert "[ROCm] Remove PYTORCH_MIOPEN_SUGGEST_NHWC flag (#90725)" (#110319)
This reverts commit 66bfcd32fd.

NHWC is have perf regression on MIOpen, so reverting till the performance issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110319
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/kit1980
2023-10-03 19:14:47 +00:00
Brian Hirsh
b457e3f79a Reland attempt 2 of "Update AOTAutograd to use FunctionalTensorMode instead of C++ functionalization (#106406)" (#109906)" (#110079)
The first reland broke internal (failing diff: D49617462).

The major error looks like it's because there's an internal-only higher order op that needs a new functionalization rule. I'm going to land an internal diff for that and confirm tests pass before relanding this PR.

Also confirmed that the issue from https://github.com/pytorch/pytorch/issues/110121 is fixed, and added a test.

This reverts commit 1b90f07f5a.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110079
Approved by: https://github.com/ezyang
2023-10-03 18:50:25 +00:00
Octavian Guzu
b5c3a17c2c [fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-buffer-overflow-far-from-bounds (size 4) in c10::IValue::IValue() (#110441)
Summary: This diff fixes a heap underflow found by fuzzing in torch/csrc/jit/runtime/vararg_functions.cpp

Test Plan:
CI and
```
arc lionhead crash reproduce 1753074381791061
```
doesn't crash anymore.

Differential Revision: D49537535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110441
Approved by: https://github.com/Skylion007
2023-10-03 18:48:12 +00:00
Yang Chen
da63c7f2c3 [AOTInductor] remove CUDA dependency for cpp backend (#110409)
Summary:
Previously, we link against cuda libs even for pure cpp backend.
This caused issues for cases where the inference platform does not
have GPUs. This diff removed cuda dependency for cpp backend.

Reviewed By: bertmaher, muchulee8, mikekgfb

Differential Revision: D49800712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110409
Approved by: https://github.com/bertmaher, https://github.com/desertfire
2023-10-03 18:36:00 +00:00
PyTorch MergeBot
df3ab70dde Revert "Added new test sample to interpolate op in OpInfo (#104181)"
This reverts commit 87f8bc65f8.

Reverted https://github.com/pytorch/pytorch/pull/104181 on behalf of https://github.com/peterbell10 due to Causing OOM in slow-gradcheck ([comment](https://github.com/pytorch/pytorch/pull/104181#issuecomment-1745472323))
2023-10-03 18:07:02 +00:00
Rohan Varma
40be6b72e1 [ez] Type function in distributed_c10d (#110435)
This function returns a `torch.device`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110435
Approved by: https://github.com/awgu
2023-10-03 17:54:04 +00:00
vfdev
5977d17953 Update common_methods_invocations.py (#110383)
Description:
- Fixed misleading test sample case

Context: sample input is composed of input tensor `(N, C, iH, iW)` and grid tensor `(N, oH, oW, 2)`, however, grid is defined as `(N, C, oW, 2)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110383
Approved by: https://github.com/peterbell10
2023-10-03 17:53:39 +00:00
Bert Maher
aecfe5d168 [aoti] Remove pessimizing move (#110446)
"`std::move` of a temporary prevents copy elision" says the compiler,
and I am pretty sure it is right.  Since AtenTensorHandle* implicitly converts
to RAIIAtenTensorHandle, I simply called emplace_back; happy to put an explicit
ctor if that makes folks happier.

Differential Revision: [D49842542](https://our.internmc.facebook.com/intern/diff/D49842542/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110446
Approved by: https://github.com/desertfire, https://github.com/Skylion007
ghstack dependencies: #110445
2023-10-03 17:44:58 +00:00
Bert Maher
174e46b853 [inductor][easy] Free functions in headers should be declared inline (#110445)
If multiple files include model.h, you end up with duplicate symbols errors.

Differential Revision: [D49842167](https://our.internmc.facebook.com/intern/diff/D49842167/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110445
Approved by: https://github.com/desertfire, https://github.com/Skylion007
2023-10-03 17:44:49 +00:00
Levy Zhao
7f0a659ccc Script to compare measured (trace) runtimes with estimated runtimes (#108037) (#109076)
Summary:

X-link: https://github.com/pytorch/benchmark/pull/1856

Reviewed By: xmfan, xuzhao9

Differential Revision: D48523883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109076
Approved by: https://github.com/xw285cornell
2023-10-03 17:05:35 +00:00
Jerry Zhang
f2a1b93549 Back out "[quant] Support integer implementations for adaptive_avg_pool2d (#104226)" (#110316)
Summary:
Original commit changeset: acdb5b34e3aa

Original Phabricator Diff: D47321689

Test Plan: opinfo tests in CI

Differential Revision: D49789403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110316
Approved by: https://github.com/kimishpatel
2023-10-03 16:59:23 +00:00
Yanbo Liang
9bc5e10899 [New][1/N] Dynamo skipfiles refactor (#110330)
This is the replacement of #109567. Now I preserved all existing semantics and only focusing on API (for developers) and code structure changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110330
Approved by: https://github.com/ezyang
2023-10-03 16:50:33 +00:00
David Berard
4069d1de59 [distributed] Remove recordStream for callback that ends a profiler event (#109933)
**Background**: recordStreams can result in memory spikes, so we don't want them to appear in FSDP (https://dev-discuss.pytorch.org/t/fsdp-cudacachingallocator-an-outsider-newb-perspective/1486). @ awgu is working on fixing this, but it turns out profiler was causing recordStream to get called when it is enabled.

Why profiler was causing recordStream to get called: NCCL calls add profiler events manually; they register a callback to be executed when the future for the collective is completed; this indicates the end of the CPU-side profiler event for the callback:

c2c7c4035f/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L1822-L1824)

In order to guarantee safety, ivalue::Future::invokeCallback calls `recordStream` on the future's storage buffers; this marks the fact that other streams (e.g. the one that the callback runs on) may need to use the storage.

c2c7c4035f/aten/src/ATen/core/ivalue_inl.h (L1171-L1173)

**Change**: The end-profiler-event callback doesn't actually use the future, so we don't need to recordStream on it. This PR introduces an optional parameter `uses_future` for adding callbacks; a user can set this variable to "false" to unsafely skip the recordStream, if the user knows that the future will not be used in the lambda.

**Tests**: (a) unit tests; (b) added an assert in recordStream: c2c7c4035f/c10/cuda/CUDACachingAllocator.cpp (L3260) and verified that it doesn't get triggered when running basic distributed tests w/ profiler enabled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109933
Approved by: https://github.com/wconstab
2023-10-03 14:40:43 +00:00
Stephen Jia
ff96f6d04f [core IR][reland] Add split.Tensor and unbind decompositions to core ATen decomp table (#110323)
Summary:
This is a reland of [github PR #110102]( https://github.com/pytorch/pytorch/pull/110102).

The original PR had to be unlanded due to internal CI failures. This diff applies some small fixes to the failing tests to adjust to the new decompositions.

Note that `lift_fresh` will not be decomposed for now, since it was found that [constant propogation looks specifically for `lift_fresh`](13af952f94/torch/fx/experimental/proxy_tensor.py (L381-L386)). Therefore decomposing `lift_fresh` will interfere with constant propogation during export.

Test Plan: Github CI and internal CI

Differential Revision: D49761321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110323
Approved by: https://github.com/jansel
2023-10-03 14:35:04 +00:00
Yu, Guangye
2cbfcc740f use torch.xpu.manual_seed_all in torch.seed (#110376)
# Motivate
Use manual_seed_all instead of manual_seed. Because multi-device is supported in xpu backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110376
Approved by: https://github.com/ezyang
2023-10-03 13:41:55 +00:00
HDCharles
428cbd7513 [ao] fixing multihead attention convert size (#110407)
Summary: after converting nn.multihead attention we weren't deleting the
old in_proj_weight and in_proj_bias despite not (really) using them.

Test Plan: python test/test_quantization.py -k
"test_custom_module_multi_head_attention"

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110407
Approved by: https://github.com/jerryzh168
2023-10-03 08:49:12 +00:00
Sherlock Huang
15219f53d1 [AOTInductor] Fix ProxyExecutor's handling on multiple outputs (#110374)
Summary: Fix ProxyExecutor after D49780781

Test Plan: buck2 run mode/dev-nosan deeplearning/aot_inductor/test:test_custom_ops

Differential Revision:
D49816044

Privacy Context Container: 368960445142440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110374
Approved by: https://github.com/chenyang78
2023-10-03 06:42:22 +00:00
wz337
d15d7a6485 [DTensorTestbase] Add "cpu:gloo,cuda:nccl" backend to DTensorTestbase (#110397)
This PR updates backend as a property to DTensorTestbase and add "cpu:gloo,cuda:nccl" support in DTensorTestbase so that we can use `cpu:gloo,cuda:nccl` backend for checkpoint unit tests.

cc. @wanchaol, @fduwjj, @XilunWu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110397
Approved by: https://github.com/wanchaol
2023-10-03 04:54:02 +00:00
Fuzzkatt
e55d6f923c minor tf32 fixes for unit tests on H100 and L40 (#110201)
fixes the following tests which were failing in the NVIDIA internal CI on H100 and L40:

test/test_nn.py:
* test_TransformerEncoderLayer_gelu_activation_cuda_tf32
* test_Transformer_multilayer_coder_cuda_tf32

test/inductor/test_torchinductor.py:
* test_batch_norm_2d_2_cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110201
Approved by: https://github.com/mikaylagawarecki, https://github.com/jansel, https://github.com/Skylion007
2023-10-03 00:10:37 +00:00
eellison
3812f2e40c Preserve layout on like constructors (#110242)
Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor.

This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242
Approved by: https://github.com/int3
2023-10-02 23:53:55 +00:00
cyy
d58a91b2a6 [4/N] Move remaining c10::variant calls to std::variant (#110382)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110382
Approved by: https://github.com/Skylion007
2023-10-02 23:52:04 +00:00
Peter Bell
01b2f25ebd [inductor] Cast loads from boolean tensors to tl.int1 (#110388)
Triton currently loads pointer to `tl.int1` as `tl.int8`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110388
Approved by: https://github.com/lezcano, https://github.com/Skylion007
2023-10-02 22:52:08 +00:00
PyTorch MergeBot
cba3f407b1 Revert "[HigherOrderOp] Flatten outputs of wrap. (#109433)"
This reverts commit 651b198cdf.

Reverted https://github.com/pytorch/pytorch/pull/109433 on behalf of https://github.com/kit1980 due to Depends on reverted https://github.com/pytorch/pytorch/pull/110290 ([comment](https://github.com/pytorch/pytorch/pull/109433#issuecomment-1743766271))
2023-10-02 21:09:19 +00:00
Chien-Chin Huang
cdde899a73 [FSDP][optim_state_dict] Fuse allgather for optim_state_dict when use_orig_params is True (#108298)
The original implementation of `_gather_orig_param_state` is naive. It performs one allgather_object and two allgather (if the optimizer is Adam) per FQN. This can be slow and make `_optim_state_dict` become bottleneck.

This PR rewrite the implementation and fuse all the `allgather_object`s into one. As for `allgather`, it is fused based on the information of FlatParameters. So there will be 2N `allgather` where N is the number of FlatParameter and 2 is due to Adam having 2 states per FQN.

One experiment on 8GPU A100 shows that the execution of the gathering is improved to 0.3 seconds from 3 seconds.

Differential Revision: [D48835138](https://our.internmc.facebook.com/intern/diff/D48835138/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/108298
Approved by: https://github.com/awgu
2023-10-02 20:57:08 +00:00
Jez Ng
15dfe7b8e3 Actually enable typechecking for _inductor/index_propagation.py (#110110)
It was supposed to be enabled in #105622 but that PR neglected to update
.lintrunner.toml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110110
Approved by: https://github.com/Skylion007
2023-10-02 20:57:03 +00:00
sunghyunjun
b5268456f9 Fix optimize_for_inference to support modules that don't have a forward method (#110013)
Fixes #108662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110013
Approved by: https://github.com/davidberard98
2023-10-02 20:13:44 +00:00
Yukio Siraichi
651b198cdf [HigherOrderOp] Flatten outputs of wrap. (#109433)
Fix: #109247

This PR flattens `wrap` outputs by inlining `pytree.tree_flatten` function after calling
the inner function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109433
Approved by: https://github.com/zou3519
ghstack dependencies: #110290
2023-10-02 19:58:30 +00:00
RihamSelim
92242f599a [PyTorch] Add Expanded call stack to nodes [Take 2] (#110229)
Summary:
Adding back D46578700 / PR https://github.com/pytorch/pytorch/pull/108426

Note: The changes were originally reverted due to memory regression, these changes are putting the code behind a gflag so it is only used by binaries that require expanded stack for BPF Profiling.

Original Diff comment:
To get a Node's call stack we currently loop on the InlinedCallStack graph and follow the "callee" chain. Since the node's inlined stack does not change we can optimize this but expanding the node's inlined stack once and reusing it. This is particularly useful when reading the node's stack from another process (e.g. BPF) as it simplified the memory traversal process.
The new data structure (NodeSourceInfo) only holds pointers to the function name and file name variables, and assumes these objects will be alive throughout the lifetime of the process.
Each Node has an extended attribute that has an index to a vector of stack frames expanded_node_stacks_
node_stack_attr_symbol_ is only needed to make accessing the stack vector index attribute easier from BPF.

Test Plan:
- Verified using BPF Program in subsequent diffs
- Perf testing for loading large model: P822455246

Differential Revision: D49565461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110229
Approved by: https://github.com/zdevito
2023-10-02 19:52:41 +00:00
Ubuntu
16e3f158b9 Add function to port FX minified graph to HLO via StableHLO (#109084)
If `XLA_HLO_DEBUG` flag is enabled, generated a minified HLO graph when using the minifier. This function enables HLO minification support by porting the minified FX graph to StableHLO via the `save_torch_model_as_stablehlo` function.

This allows users to port the minified graph to compilers that are not compatible with TorchDynamo/Inductor workflow and use XLA instead. The purpose of this PR is to help XLA users debug accuracy and compilation errors. It will also be helpful for existing TorchDynamo/XLA workflow on `torchxla_trace_once` backend as well.

Fixes [#5461](https://github.com/pytorch/xla/issues/5461) in Torch XLA repo. CC @GleasonK @qihqi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109084
Approved by: https://github.com/anijain2305
2023-10-02 19:36:04 +00:00
PyTorch MergeBot
7e6cf04a84 Revert "Multiprocessing support for NT (#110292)"
This reverts commit 881e7304d6.

Reverted https://github.com/pytorch/pytorch/pull/110292 on behalf of https://github.com/jbschlosser due to Address review comments ([comment](https://github.com/pytorch/pytorch/pull/110292#issuecomment-1743524901))
2023-10-02 18:27:13 +00:00
Joel Schlosser
881e7304d6 Multiprocessing support for NT (#110292)
Fixes #110161

Allows NTs to be used in DataLoaders with `num_workers > 1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110292
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #110219
2023-10-02 18:14:34 +00:00
Wanchao Liang
26900d21c2 [dtensor] skip pytree when not necessary (#110132)
pytree is a great tool, but it sometimes considers to be evil for
tensor subclasses, it's useful to implement subclass quickly, but it:
* exposes non-trival CPU overhead
* many ops don't need pytree, only the one with list/dict ops needs
* blindly use pytree to re-wrap have semantic issues for inplace/out
ops

This PR avoid using pytree for most ops during torch_dispatch and only
enable it for certain ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110132
Approved by: https://github.com/fduwjj
2023-10-02 17:44:34 +00:00
Jon Chuang
46d1f9b385 fix(lint): Fix lint issues on main (#110389)
Lint issue was introduced in https://github.com/pytorch/pytorch/pull/110186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110389
Approved by: https://github.com/Skylion007, https://github.com/malfet
2023-10-02 17:04:01 +00:00
Li-Huai (Allan) Lin
a3c1e3c95c Generalize toAccumulateType() (#108248)
Trying to address this comment: https://github.com/pytorch/pytorch/pull/106666#discussion_r1297397554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108248
Approved by: https://github.com/kulinseth, https://github.com/albanD
2023-10-02 16:34:36 +00:00
Angela Yi
e47e946bbf [aotinductor] Use dynamic_shape instead of constraints (#110360)
Summary:
Previously we used export's constraints to specify all batch-size dimensions being dynamic. This is done by creating 1 constraint `dynamic_dim(inp[0][0], lower, upper)`, followed by `dynamic_dim(inp[0][0]) == dynamic_dim(inp[i][0])` for every input `i`.

Through the new `dynamic_shapes` API, we can use `Dims("batch_size")` on every dimension to specify which dimensions are dynamic and equal to each other, and `None` otherwise: `{i: [Dims("batch_size", lower, upper), None] for every input i}`

Note: `dynamic_shapes` and `constraints` utilize the same "constraints" backend so this diff should be idempotent.

Test Plan: `buck2 run @//mode/dev-nosan //caffe2/torch/fb/model_transform/experimental/benchmark/test/aotinductor:test_aot_inductor_benchmark`

Reviewed By: chenyang78, aakhundov

Differential Revision: D49784351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110360
Approved by: https://github.com/desertfire
2023-10-02 16:09:37 +00:00
vfdev-5
87f8bc65f8 Added new test sample to interpolate op in OpInfo (#104181)
Description:
- Added new test sample to interpolate op in OpInfo
- Fixed silent issue with zero tensor test sample for uint8 dtype

Pull Request resolved: https://github.com/pytorch/pytorch/pull/104181
Approved by: https://github.com/pmeier, https://github.com/lezcano
2023-10-02 15:35:48 +00:00
cdzhan
175b626216 Enable torch.promote_types in Dynamo tracing (#110358)
Fixes #109508

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110358
Approved by: https://github.com/Skylion007
2023-10-02 15:20:36 +00:00
Alexander Grund
e0348ceceb Avoid undefined behavior in JIT-generated conversion code (#110212)
The inductor/dynamo JIT generator creates C++ code using `static_cast` for type conversions.
This is can be undefined behavior for e.g. `static_cast<uint8_t>(floatVal)` where `floatVal` is a negative value.

To avoid this in the "regular" C++ code `c10::convert` is used. So use it in the JIT generated code too.

Fixes #110077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110212
Approved by: https://github.com/ezyang, https://github.com/jgong5, https://github.com/desertfire
2023-10-02 12:56:41 +00:00
Menglu Yu
f7812cdbd9 [inductor][Optimus]Improve logging for Optimus (#110186)
Summary: It is based on the diff D49340843. We add more logs for better debug and logging purposes.

Test Plan:
```
[2023-09-27 20:35:53,844] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Before group_batch fusion in pre grads pass. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GEoA8xb22jibUNEEAPYecF9_RVM1br0LAAAz
[2023-09-27 20:35:55,001] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchLinearFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPMR9BYffjwToEQCAFS7rgixMi0pbr0LAAAz
[2023-09-27 20:35:57,419] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchLinearLHSFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GKiA8hNycGpBdAIDAOn0c1Hpef4sbr0LAAAz
[2023-09-27 20:35:57,585] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] BatchLayernormFusion: key = ('batch_layernorm', 'torch.Size([2048, 128])', 'torch.Size([128])', 'torch.Size([128])', '(128,)', '1e-05'); subset size = 7
[2023-09-27 20:35:58,493] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchLayernormFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GKpftRa9Glxm-MYDAOZb_D80JHsYbr0LAAAz
[2023-09-27 20:35:59,754] [0/0] torch._inductor.fx_passes.group_batch_fusion: [INFO] Apply fusion BatchTanhFusion. Print graph: https://www.internalfb.com/intern/everpaste/?color=0&handle=GPgh9BZQl4EKGckAAES094iV3Atrbr0LAAAz
I0927 20:36:00.532000 3750607 pre_grad.py:71] After group_batch_fusion_pre_grad_passes: https://www.internalfb.com/intern/everpaste/?color=0&handle=GBPb8xYxfrbXuCMDAI5d_a4YyhFBbr0LAAAz
```

Differential Revision: D49710166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110186
Approved by: https://github.com/jackiexu1992, https://github.com/yanboliang
2023-10-02 07:29:25 +00:00
wz337
a588648759 [DCP] Fix 'torch.cpu' has no attribute 'current_device' in checkpoint/optimizer.py (#110299)
When running on "gloo" and "cpu:gloo,cuda:nccl" backend, it will run into the following error.

```
-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/data/users/irisz/pytorch/torch/multiprocessing/spawn.py", line 74, in _wrap
    fn(i, *args)
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/examples/fsdp_checkpoint_example.py", line 105, in run_fsdp_checkpoint_example
    optim_state = load_sharded_optimizer_state_dict(
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 295, in load_sharded_optimizer_state_dict
    _alloc_tensor(value.properties, value.size, dp_pg_device_type), sharding_spec
  File "/data/users/irisz/pytorch/torch/distributed/checkpoint/optimizer.py", line 109, in _alloc_tensor
    device=cast(torch.device, _get_device_module(device_type).current_device()),
AttributeError: module 'torch.cpu' has no attribute 'current_device'
```

This PR fix the error in optimizer.py. Will follow up to add "cpu:gloo,cuda:nccl" support in DTensorBase so we can update unit test to include this backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110299
Approved by: https://github.com/kumpera
2023-10-01 21:54:13 +00:00
Angela Yi
13af952f94 [export] Add run_decomposition() function to ExportedProgram (#110236)
Summary:
https://docs.google.com/document/d/1QJJEGnj2nHGPODlw38BEG3KLLCOTfdOVjPrNQbz_LM8/edit#bookmark=id.lp80wfshq130

`exported_program.run_decompositions(decomposition_table)` will optionally take a decomposition table, and run decompositions on the exported program, returning a new exported program. By default we will run the Core ATen decomposition table.

Splitting up this diff with the following one (D49742989) to make migrating Executorch easier:
1. Land this diff
1. Wait for a pytorch nightly to include this diff
1. Update executorch's pytorch nightly
1. Land the following diff to have export() return no decomps

Test Plan: Tested in following diff

Differential Revision: D49743208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110236
Approved by: https://github.com/gmagogsfm
2023-10-01 18:18:27 +00:00
chilli
13681382d5 Add heuristic for when evict_first should be set (and some other minor things) (#108841)
Example of when the `evict_first` heuristic helps.
```
@torch.compile
def f(a, b):
    return (a * b).sum(dim=-1)

N = 512
inps = (torch.randn(N, N, N).permute(2, 1, 0), torch.randn(N, N, N).permute(1, 2, 0))
from torch._inductor.utils import do_bench
print(do_bench(lambda: f(*inps)))
```

This generates code like this: http://ix.io/4HFs

```
Original: 3.8 ms
This PR: 3.54 ms
Always `evict_first: 5.4ms
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108841
Approved by: https://github.com/lezcano, https://github.com/jansel
2023-10-01 17:06:12 +00:00
ruiren
e4414716d5 [onnx] support attn_mask fp16 type (#110306)
When users define customized `attention mask` using `dtype=torch.float16`, e.g.

```
from torch.nn import functional as F

float_min = torch.finfo(torch.float16).min

attention_mask_fp16 = (attention_mask * 1.0).masked_fill(attention_mask, float_min).to(torch.float16)

attn_output = F.scaled_dot_product_attention(
                 query_layer_, key_layer_, value_layer_, attention_mask_fp16, 0.0, is_causal=False
 )
```

 the onnx graph cannot be exported.

When q, k ,v have the fp16 type, we can support this `attn_mask` to be `fp16` type, by adding
```
elif (
        _type_utils.JitScalarType.from_value(attn_mask)
        == _type_utils.JitScalarType.FLOAT
        in (_type_utils.JitScalarType.FLOAT, _type_utils.JitScalarType.HALF)
```
This can export `.onnx` graph.

Fixes #109336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110306
Approved by: https://github.com/titaiwangms
2023-10-01 14:50:58 +00:00
Sherlock Huang
898656e9d1 [AOTInductor] ProxyExecutor supports Tuple of Tensor and List[Tensor] in returns (#110187)
Summary:
ProxyExecutor supports custom ops that return a tuple mixed of Tensor and List[Tensor]
e.g. `"fn_with_mix_outputs(Tensor t, Tensor[] tensors) -> (Tensor, Tensor[])"`

Example:
`out7, [out8, out9] = torch.ops.fb.fn_with_mix_outputs(out5, [out6, out4])`
got compiled into
```
    AtenTensorHandle buf11_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf11_handle));
    RAIIAtenTensorHandle buf11(buf11_handle);
    AtenTensorHandle buf12_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf12_handle));
    RAIIAtenTensorHandle buf12(buf12_handle);
    AtenTensorHandle buf13_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf13_handle));
    RAIIAtenTensorHandle buf13(buf13_handle);
    AtenTensorHandle tensor_args_var_7[] = {buf8.get(), buf9.get(), buf6.get(), buf11.get(), buf12.get(), buf13.get()};
    int64_t int_args_var_8[] = {};
    aoti_torch_proxy_executor_call_function(proxy_executor, 3, 0, int_args_var_8, 6, tensor_args_var_7);
```

Serialized extern node
```
    {
      "name": "buf10",
      "node": {
        "target": "fb::fn_with_mix_outputs",
        "inputs": [
          {
            "name": "t",
            "arg": {
              "asTensor": {
                "name": "buf8"
              }
            }
          },
          {
            "name": "tensors",
            "arg": {
              "asTensors": [
                {
                  "name": "buf9"
                },
                {
                  "name": "buf6"
                }
              ]
            }
          }
        ],
        "outputs": [
          {
            "asTensor": {
              "name": "buf11"
            }
          },
          {
            "asTensors": [
              {
                "name": "buf12"
              },
              {
                "name": "buf13"
              }
            ]
          }
        ],
        "metadata": {}
      }
    }
```

Test Plan: Test

Differential Revision: D49710320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110187
Approved by: https://github.com/chenyang78
2023-09-30 19:47:01 +00:00
Colin Peppler
6bb448a2d3 [inductor][fbcode] Add -D C10_DISABLE_TENSORIMPL_EXTENSIBILITY to cpp_compile_command (#110122)
Summary:
## Why?

The .so and .h files are compiled seperately with different flags. The .so is compiled by AOTInductor and .h files (eg. c10/core/TensorImpl.h) are compiled by buck2.

Let's make sure the .so is also compiled with this macro in fbcode.

Differential Revision: D49664078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110122
Approved by: https://github.com/chenyang78, https://github.com/khabinov
2023-09-30 16:34:59 +00:00
cyy
d0ad848aa5 Enable misc clang-tidy checks (#110283)
This PR enables the misc-XX checks in clang-tidy. Meanwhile, I excluded some of them that require a lot of code changes and have no immediate benefits. Some additional fixes and suppression were also given.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110283
Approved by: https://github.com/albanD
2023-09-30 10:39:52 +00:00
Adnan Akhundov
2ead6c2f6e Skip launching kernels with zero grid in AOT Inductor (#110312)
Summary: with the grid computed in terms of unbacked `SymInt`s, it can happen that the grid is zero size. This causes CUDA error on `cuLaunchKernel` in the AOT Inductor codegen.

In this PR, when the grid contains unbacked `SymInt`s, a check is added around the `launchKernel` in the AOT Inductor's C++ wrapper codegen to make sure that the grid is not zero-size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110312
Approved by: https://github.com/chenyang78
2023-09-30 09:12:56 +00:00
Oguz Ulgen
f7ba3e85e2 [Dynamo] Add functional triton kernel wrapper (#110185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110185
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/bdhirsh
ghstack dependencies: #109623
2023-09-30 04:20:20 +00:00
Nikita Shulga
ad8aef0f98 [BE] [3/N] Use nested namespaces (#110314)
Mostly in torch/csrc/jit/runtime and in `ATen/cuda/`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110314
Approved by: https://github.com/seemethere
2023-09-30 02:23:48 +00:00
leslie-fang-intel
7eeb392eb3 [Inductor] Enable the item() and nonzero() codegen test on CPU (#110262)
**Summary**
Follow up https://github.com/pytorch/pytorch/pull/109893 which has issue in support of CPU as reported in https://github.com/pytorch/pytorch/issues/109897. This fix mainly includes 2 changes:

-  Current implementation of `rename_indexing`
10c646295d/torch/_inductor/codegen/common.py (L1023) only add symbol name start with `s` or `ps` into `kernel.args.sizevars`. However, `Unbacked symint` will start as `i`, so we extend the implementation of `rename_indexing` to support symbol start with `i`.
- Currently, the internal loop index also name start as `i`. Since `i` has has been used as `Unbacked symint`, change the name to start with `x` which should align with trition.

**Test Plan**
```
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_bool_mask_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_nonzero_size_factory_nobreak
python -u -m pytest -s -v test_torchinductor_dynamic_shapes.py -k test_item_zeros_nobreak
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110262
Approved by: https://github.com/ezyang, https://github.com/jgong5
2023-09-30 00:13:20 +00:00
ancestor-mithril
e0be9ebc18 Simplify the conditionals used for learning rate calculation for ConstantLR learning rate scheduler (#109785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109785
Approved by: https://github.com/janeyx99, https://github.com/kit1980
2023-09-29 23:11:23 +00:00
Bin Bao
993eea0edd [aotinductor] Fix a missing schema issue for repeat_interleave (#110105)
Differential Revision: [D49686812](https://our.internmc.facebook.com/intern/diff/D49686812)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110105
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/aakhundov
2023-09-29 23:01:37 +00:00
davidgens-cerebras
ee0bff209c [LTC] correct AdaptiveAvgPool3d channel dim index for shape inference (#109822)
Fixes #109821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109822
Approved by: https://github.com/mikaylagawarecki, https://github.com/alanwaketan
2023-09-29 22:54:12 +00:00
PyTorch MergeBot
b083058e45 Revert "Make unbind() overrideable for NT subclass (#109122)"
This reverts commit f5a23ca78d.

Reverted https://github.com/pytorch/pytorch/pull/109122 on behalf of https://github.com/PaliC due to breaking slow tests ([comment](https://github.com/pytorch/pytorch/pull/109122#issuecomment-1741555305))
2023-09-29 22:41:56 +00:00
Octavian Guzu
9c7071b0e3 [fuzzing result][fuzz_torch_jit_lite_interpreter] read-heap-use-after-free (size 8) in std::_Function_base::_M_empty() (#110289)
Summary: This diff fixes a heap UAF found by fuzzing in torch/csrc/jit/mobile/interpreter.cpp

Test Plan:
CI and
```
arc lionhead crash reproduce 1009060456885023
```
doesn't crash anymore.

Reviewed By: malfet

Differential Revision: D49538326

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110289
Approved by: https://github.com/malfet
2023-09-29 22:32:38 +00:00
Avik Chaudhuri
359c2a53f5 dynamic_shapes + retrace exported program (#110276)
An `ExportedProgram`'s `__call__` signature is different from the original module, so `dynamic_shapes` that follow the original signature would fail when applied to re-export an `ExportedProgram`.

This PR fixes this issue, in other words, the original `dynamic_shapes` should now work when re-exporting.

Differential Revision: D49764011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110276
Approved by: https://github.com/tugsbayasgalan
2023-09-29 21:06:46 +00:00
PyTorch MergeBot
c2c7c4035f Revert "Simplify the conditionals used for learning rate calculation for ConstantLR learning rate scheduler (#109785)"
This reverts commit 83283b4f0d.

Reverted https://github.com/pytorch/pytorch/pull/109785 on behalf of https://github.com/PaliC due to causing macos errors as per 83283b4f0d ([comment](https://github.com/pytorch/pytorch/pull/109785#issuecomment-1741471142))
2023-09-29 20:49:28 +00:00
atalman
b253fc9c93 Revert "[1/N] Dynamo skipfiles refactor (#109567)" (#110296)
This reverts commit 84c5435b29.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110296
Approved by: https://github.com/yanboliang
2023-09-29 20:35:46 +00:00
Peter Bell
bc047ec906 [inductor] Make sure unfuse_addmm and addmm patterns don't overlap (#110235)
Inductor has two opposing patterns,
```
addmm -> add + mm
add + mm -> addmm
```

This uses the `extra_check` to disable the addmm fusion pattern when the
heuristic to unfuse add is met, for consistency.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110235
Approved by: https://github.com/lezcano, https://github.com/eellison
ghstack dependencies: #110232
2023-09-29 19:35:29 +00:00
Peter Bell
d04b35e7e3 [inductor] Fix bug in input mutation (#107614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107614
Approved by: https://github.com/jansel
2023-09-29 18:27:06 +00:00
Sherlock Huang
d7de26804e [AOTInductor] ProxyExecutor supports List[Tensor] return type (#110182)
Summary:
Support custom ops returns List[Tensor] type, like `"fn_with_list_output(Tensor[] tensors, int i) -> Tensor[]"`

As an example
`out5, out6 = torch.ops.fb.fn_with_list_output([out3, out4], 1)`

got compiled into

```
    AtenTensorHandle buf8_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf8_handle));
    RAIIAtenTensorHandle buf8(buf8_handle);
    AtenTensorHandle buf9_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf9_handle));
    RAIIAtenTensorHandle buf9(buf9_handle);
    AtenTensorHandle tensor_args_var_5[] = {buf5.get(), buf6.get(), buf8.get(), buf9.get()};
    int64_t int_args_var_6[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 2, 1, int_args_var_6, 4, tensor_args_var_5);
```

Test Plan: Test

Differential Revision: D49694691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110182
Approved by: https://github.com/chenyang78
2023-09-29 18:21:48 +00:00
Mu-Chu Lee
d6d3f6cfe5 Add weight update for DSOModel. (#110273)
Summary: Add weight update for DSOModel and AOTInductorModel

Test Plan: buck2 test accelerators/workloads/models/slimdsnn:slimdsnn_dso_test - SlimDSNN.DSO_Update_Constants

Reviewed By: mikekgfb

Differential Revision: D49748685

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110273
Approved by: https://github.com/hl475
2023-09-29 18:14:01 +00:00
Yang Chen
30759848fa [inductor] handle non-list/tuple outputs for FallbackKernel (#110145)
generate_output may return non-list/tuple outputs. Let's force
those to be list, because we will enumerate kernel.outputs
later in the codegen.

Also fixed a minor issue in an assertion message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110145
Approved by: https://github.com/aakhundov
2023-09-29 17:13:26 +00:00
Bin Bao
0ff1155d3a [aotinductor] Refactor test_aot_inductor to take different devices (#110216)
Summary: Replace hardcoded device to self.device, to make it easier to test both cpu and cuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110216
Approved by: https://github.com/chenyang78, https://github.com/bertmaher
ghstack dependencies: #110215
2023-09-29 16:30:19 +00:00
Andrei Gheorghe
28f52f2f80 Fix aminmax on CUDA when input shape contains 0 (#107564)
The CUDA kernel asserts numel() > 0, the CPU kernel doesn't and returns empty values (as expected)

Fixes #95349 and #85439

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107564
Approved by: https://github.com/lezcano
2023-09-29 16:18:08 +00:00
Oguz Ulgen
2d50a30d77 [Dynamo] Add native support for Triton Kernels to Dynamo (#109623)
This PR adds native support to Dynamo to detect Triton kernels and
create an FX graph node out of them. AOT eager and inductor modes will
be support in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109623
Approved by: https://github.com/jansel
2023-09-29 15:49:18 +00:00
Joel Schlosser
3693777a86 Pickle support for NT (#110219)
Fixes #104198
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110219
Approved by: https://github.com/cpuhrsch
2023-09-29 15:30:06 +00:00
Bert Maher
92f4a7b663 [inductor] Add fbcode include path for cuda (#110240)
We missed the cuda include, leading to failures in cases where CUDA
was not installed locally but only provided via third-party/GVFS.

Differential Revision: [D49745585](https://our.internmc.facebook.com/intern/diff/D49745585/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110240
Approved by: https://github.com/hl475
2023-09-29 13:39:40 +00:00
Peter Bell
758735b739 [dynamo] Convert dtype arguments as well as inputs in cast_to_fp64 (#110232)
Generating reference outputs somtimes fails because of type mismatches in the graph,
an issue which was noticed previously for `prims.convert_element_type` and fixed in #92036
but the same issue happens with other functions such as tensor constructors.

This expands the fix from #92036 to all dtype keyword arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110232
Approved by: https://github.com/ezyang
2023-09-29 12:42:14 +00:00
Rohan Varma
24e5d61af8 Log usage of optimizer in backward (#110206)
This will allow us to inspect and aggregate jobs that use optimizer in
backward

Differential Revision: [D48674740](https://our.internmc.facebook.com/intern/diff/D48674740/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110206
Approved by: https://github.com/awgu
2023-09-29 11:00:07 +00:00
ancestor-mithril
d615f0078c Updating documentation for PolynomialLR (#110151)
Docstring mentions the power parameter is `int`, when it should have been `float`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110151
Approved by: https://github.com/janeyx99
2023-09-29 03:50:11 +00:00
jjsjann123
e6b5e0ecc6 removing the functionality of nvfuser python APIs (#110124)
Removing the functionalities from nvfuser python APIs.

Since the use of nvfuser has been deprecated before the last release cut. We are removing torch script support.

I'll have the next PR to actually remove the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110124
Approved by: https://github.com/davidberard98
2023-09-29 01:45:00 +00:00
rzou
88de391692 [torch.library] Fix some docstrings (#110214)
Removed some erroneous colons

Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110214
Approved by: https://github.com/ezyang
2023-09-29 01:44:49 +00:00
ancestor-mithril
83283b4f0d Simplify the conditionals used for learning rate calculation for ConstantLR learning rate scheduler (#109785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109785
Approved by: https://github.com/janeyx99, https://github.com/kit1980
2023-09-29 01:19:05 +00:00
Jerry Zhang
c9b8e06060 [quant] Enable quantization for wav2letter (#109830)
Summary:
Also added annotation support for conv1d_relu and conv1d in XNNPACKQuantizer, the quantized results still
matches fx quant path (didn't quantize conv1d) so tests are not disabled

Test Plan: with-proxy buck2 run executorch/examples/quantization:example -- -m=w2l --verify

Differential Revision: D49479546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109830
Approved by: https://github.com/kimishpatel
2023-09-29 00:47:34 +00:00
Animesh Jain
ce8b4f56d8 [dynamo] Dont put nn module guards on torch inbuilt nn modules (#110230)
This is one way to fix https://github.com/pytorch/pytorch/issues/110048

Looking for feedback.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110230
Approved by: https://github.com/ezyang
2023-09-29 00:43:16 +00:00
chunyuan
20dabea35d Inductor cpp wrapper: support MkldnnRnnLayer (#107858)
1. Directly use the `codegen` function of the parent class which already supported both python and cpp wrapper.
2. The output of the `at::mkldnn_rnn_layer` OP is actually a `std::tuple` 1491bae277/aten/src/ATen/native/mkldnn/RNN.cpp (L218) Fix the type when calling `MultiOutput`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/107858
Approved by: https://github.com/jgong5, https://github.com/jansel
2023-09-29 00:22:42 +00:00
Edward Z. Yang
d1a13129bb Add support for item() and nonzero() codegen in Inductor (#109893)
This is another version of
https://github.com/pytorch/pytorch/pull/109262 that I think is more
harmonious with inductor design.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109893
Approved by: https://github.com/jansel
2023-09-28 23:37:31 +00:00
Jerry Zhang
3de42995e4 [quant][pt2e] Add quant API re-entrant test (#110125)
Summary:
Add the test to make sure we can call the quantize API multiple times

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_reentrant

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110125
Approved by: https://github.com/kimishpatel
ghstack dependencies: #110097
2023-09-28 22:41:59 +00:00
skc7
bbb95878e9 [LLVM] Update apis incompatible with llvm versions in codegen (#110200)
Opaque pointers support is disabled in llvm 14 and enabled by default from llvm 15 and above.
setOpaquePointers api usage is deprecated from llvm 16. Removed this API.

Update CreateMalloc and CreateFree apis for latest llvm release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110200
Approved by: https://github.com/Skylion007
2023-09-28 21:49:30 +00:00
Peter Bell
be3b16daad [decomp] Fix baddbmm decomposition (#109714)
The decomposition is currently registered without the pw_cast_for_opmath
decorator, due to the ordering of decorators being meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109714
Approved by: https://github.com/lezcano
2023-09-28 21:23:44 +00:00
PyTorch MergeBot
e0b035c220 Revert "[core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102)"
This reverts commit 22e706f768.

Reverted https://github.com/pytorch/pytorch/pull/110102 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/110102#issuecomment-1739856671))
2023-09-28 19:03:25 +00:00
Yang Chen
aaaa3c1586 Fixed minor issues for bmm/mm decompositon (#109836)
Summary:
* Fixed minor issues for bmm/mm decompositon
* enabled addmm for inductor

Test Plan: ci

Reviewed By: mikekgfb

Differential Revision: D49522332

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109836
Approved by: https://github.com/jansel, https://github.com/mikekgfb
2023-09-28 18:45:01 +00:00
cyy
168f516fae [3/N] Move c10::variant to std::variant (#110141)
This PR moves more c10::variant calls to std::variant

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110141
Approved by: https://github.com/Skylion007
2023-09-28 18:43:55 +00:00
Yanbo Liang
84c5435b29 [1/N] Dynamo skipfiles refactor (#109567)
This is 1/N of the dynamo skipfiles/allowed_functions refactor, the major change in this PR includes:
* Refactor & define the [skipfiles rules](https://github.com/pytorch/pytorch/pull/109567/files#diff-5aa3ce9db729bf0901ea97a5d3cc51924cc8575d9c516c1c8f572a35de92544aR56) and interface
* For every ```skipfiles.check```, we return both the check result and the skip/inline reason and log them for debugging.
* We found several latent issues/bugs and incorrect implementations in the codebase, but I'm planning to fix them in follow-up PRs to make the refactor decoupled with bug fixes.
* More details in the inline comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109567
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
2023-09-28 18:36:46 +00:00
Jerry Zhang
e3eb1d92d8 [quant][docs] Add documentation for prepare_pt2e, prepare_qat_pt2e and convert_pt2e (#110097)
Summary:
att

Test Plan:
.

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110097
Approved by: https://github.com/kimishpatel
2023-09-28 18:24:58 +00:00
Evgeni Burovski
3603f646eb BUG: fix torch._numpy.arange(5, dtype="float32") (#110005)
Make `np.arange` respect an explicitly provided dtype.

Also remove duplicated tests:
- torch_np/test_function_base.py::TestArange is a dupe of
- torch_np/numpy_tests/core/test_multiarray.py::TestArange

Fixes #109975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110005
Approved by: https://github.com/lezcano
2023-09-28 18:21:18 +00:00
ydwu4
5f7eff0adb Replace node.meta source_fn with source_fn_stack (#108595)
A resubmit of https://github.com/pytorch/pytorch/pull/108447. Copy over the descriptions:

This is a follow-up of the discussion in https://github.com/pytorch/pytorch/pull/108356, where we want to repalce source_fn with source_fn_stack

Before this PR, for the following example:
```python
backend = EagerAndRecordGraphs()

@torch.compile(backend=backend, fullgraph=True)
def cond_f(pred, pred2, x, y):
    def true_fn(pred2, x, y):
        return x + y

    def false_fn(pred2, x, y):
        def true_fn2(x, y):
            return x.sin() - y.cos()

        def false_fn2(x, y):
            return x.cos() - y.sin()

        return control_flow.cond(pred2, true_fn2, false_fn2, (x, y))

    return control_flow.cond(pred, true_fn, false_fn, (pred2, x, y))
```
The graph captured is shown below:
```python
class GraphModule(torch.nn.Module):
    def forward(self, L_pred_ : torch.Tensor, L_pred2_ : torch.Tensor, L_x_ : torch.Tensor, L_y_ : torch.Tensor):
        l_pred_ = L_pred_
        l_pred2_ = L_pred2_
        l_x_ = L_x_
        l_y_ = L_y_

        cond_true_1 = self.cond_true_1
        cond_false_1 = self.cond_false_1
        cond = torch.ops.higher_order.cond(l_pred_, cond_true_1, cond_false_1, [l_pred2_, l_x_, l_y_]);  l_pred_ = cond_true_1 = cond_false_1 = l_pred2_ = l_x_ = l_y_ = None
        return (cond,)

    class GraphModule(torch.nn.Module):
        def forward(self, l_pred2_, l_x_, l_y_):
            add = l_x_ + l_y_;  l_x_ = l_y_ = None
            return add

    class GraphModule(torch.nn.Module):
        def forward(self, l_pred2_, l_x_, l_y_):
            cond_true_0 = self.cond_true_0
            cond_false_0 = self.cond_false_0
            cond = torch.ops.higher_order.cond(l_pred2_, cond_true_0, cond_false_0, [l_x_, l_y_]);  l_pred2_ = cond_true_0 = cond_false_0 = l_x_ = l_y_ = None
            return cond

        class GraphModule(torch.nn.Module):
            def forward(self, l_x_, l_y_):
                sin = l_x_.sin();  l_x_ = None
                cos = l_y_.cos();  l_y_ = None
                sub = sin - cos;  sin = cos = None
                return sub

        class GraphModule(torch.nn.Module):
            def forward(self, l_x_, l_y_):
                cos = l_x_.cos();  l_x_ = None
                sin = l_y_.sin();  l_y_ = None
                sub = cos - sin;  cos = sin = None
                return sub
```
the source_fn for inner cond, sin, cos will be a (name, target) tuple:
```
('cond', <torch._ops.HigherOrderOperator object at xxx>)
('sin', 'sin')
('cos', 'cos')
('sub'. <built-in function sub>)
```

After this pr, the source_fn_stack will be a list of (name, target) tuple. The bottom of stack is the end of the list.
```
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>)],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sin', 'sin')],
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cos', 'cos')]
[('cond', <torch._ops.HigherOrderOperator object at xxx>), ('cond', <torch._ops.HigherOrderOperator object at xxx>), ('sub', <built-in function sub>)]
```

Test Plan:
See added tests in test_higher_order_ops.py and modify existing test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108595
Approved by: https://github.com/angelayi, https://github.com/zou3519
2023-09-28 18:18:36 +00:00
rzou
1d0a8eed5d [generate_opcheck_tests] Enable using same failures_dict for multiple testclasses (#110164)
This PR allows us to use the same failures_dict for multiple test
classes. This is helpful if you have a bunch of small TestCase(es) and
to centralize all the failures dict into one big one.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110164
Approved by: https://github.com/williamwen42
2023-09-28 17:56:45 +00:00
vfdev-5
c62be12061 Added batch rules for _upsample_bi*2d_aa and _upsample_bi*2d_aa_backward (#110172)
Description:
- Added batch rules for `_upsample_bi*2d_aa` and `_upsample_bi*2d_aa_backward`
- Added few more test cases into `sample_inputs_upsample_aten`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110172
Approved by: https://github.com/kshitij12345, https://github.com/zou3519
2023-09-28 17:42:48 +00:00
cyy
7f5fd92372 Reland use std::make_unique after internal changes (#109742)
check internal
follow up of #109780
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109742
Approved by: https://github.com/ezyang
2023-09-28 17:24:08 +00:00
Edwiv
7f5737392d [FSDP] fix: fix for fsdp exec order pre fwd record (#110138)
When the sharding_strategy is set to SHARD_GRAD_OP and forward_prefetch=True, during direct validation run, self.is_first_iter will always be True (because training=False, iter+1 is not executed). Additionally, the _pre_forward_order_index of the first handle entering the record_pre_forward function is 0. This causes the handle to have a False result in the if condition at line 166 when entering the record_pre_forward function again (the expected value should be True because _pre_forward_order_index has actually been assigned a value). As a result, the first handle is repetitively added to handles_pre_forward_order, leading to incorrect prefetching order.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110138
Approved by: https://github.com/awgu
2023-09-28 15:45:05 +00:00
Yukio Siraichi
6f48d872d0 Re-land: Break graph on manual_seed. (#109109)
Re-landing: #108647 (old #107594)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109109
Approved by: https://github.com/lezcano
2023-09-28 15:28:40 +00:00
Bert Maher
5f417fd710 [aot_inductor] Lightweight model runner (#110158)
It's useful to have a simple, lightweight way to run a model that adds
essentially no overhead to calling the model's generated `run_impl` method.
This C API is a super thin wrapper around AOTInductorModel: Create, Run, and
Delete are provided, and do very little work beyond dispatch to the appropriate
helpers.

Note the Create function also provides additional functionality beyond the
Container API; it allows the user to pass in a weight map defined in userland,
which is a requirement for several serving use cases.

Differential Revision: [D49670711](https://our.internmc.facebook.com/intern/diff/D49670711/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110158
Approved by: https://github.com/desertfire, https://github.com/chenyang78
2023-09-28 14:59:41 +00:00
Xu Zhao
ad0ba5e187 [torchbench] Consistent accuracy results with dynamobench (#110189)
Summary:
Use the upstream `torch._dynamo.same` function in accuracy checking and remove the self-hosted version in torchbench.

Now cmf_10x and ads_dhen_5x can run in deterministic mode, enable deepcopy and deterministic mode.

Test Plan:
```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --accuracy
Running train method from cmf_10x on cuda in eager mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```

```
$ buck2 run mode/opt //pytorch/benchmark:run -- cmf_10x -d cuda -t train --torchdynamo inductor --torchinductor_enable_batch_fusion --torchinductor_enable_split_cat_fx_pass --accuracy
Running train method from cmf_10x on cuda in dynamo inductor mode with input batch size 4 and precision tf32.
Accuracy:                            pass
```

Without this PR, it will print:

```
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_dynamo/utils.py", line 190, in time_wrapper
    r = func(*args, **kwargs)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 464, in run
    return super().run(*args)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/fx/interpreter.py", line 138, in run
    self.env[node] = self.run_node(node)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/graph.py", line 826, in run_node
    result.realize_hint()
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5273, in realize_hint
    and self.is_pointwise_non_scalar_tensor_num_reads_larger_than_one()
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/utils.py", line 343, in wrapper
    setattr(self, key, fn(self))
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in is_pointwise_non_scalar_tensor_num_reads_larger_than_one
    (sum(read.index != 0 for read in self.data.get_reads()) > 1)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/ir.py", line 5332, in <genexpr>
    (sum(read.index != 0 for read in self.data.get_reads()) > 1)
  File "/mnt/xarfuse/uid-234232/9aa53cfe-seed-nspid4026531836_cgpid9238070-ns-4026531840/torch/_inductor/dependencies.py", line 74, in index
    raise NotImplementedError("StarDep does not have an index")
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NotImplementedError: StarDep does not have an index
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True
```

Reviewed By: jackiexu1992, mengluy0125

Differential Revision: D49639733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110189
Approved by: https://github.com/desertfire, https://github.com/jansel
2023-09-28 14:50:57 +00:00
Bin Bao
8e14e76c34 [inductor] Enhance an input type assertion msg (#110176)
Summary: to address https://github.com/pytorch/pytorch/issues/110089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110176
Approved by: https://github.com/angelayi
2023-09-28 13:35:11 +00:00
Bert Maher
eb082ef604 [inductor] Decompose addmm if it's a dot product on cpu (#110010)
Generated code for dot product is often faster (on CPU) than
dispatching to aten, since it avoids op dispatch overhead and allows fusion
with surrounding ops, which in turn avoids allocations.

Differential Revision: [D49595876](https://our.internmc.facebook.com/intern/diff/D49595876/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110010
Approved by: https://github.com/chenyang78, https://github.com/jgong5, https://github.com/mikekgfb
2023-09-28 13:30:14 +00:00
aashishthakur10
ee8983da70 109605 dynamo scalar ndarray pow gen (#109953)
Fixes #109605

Generated code before:
```
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (8, ), (1, ))
    buf0 = empty_strided((), (), device='cpu', dtype=torch.int64)
    cpp_fused_lift_fresh_0(c_void_p(buf0.data_ptr()))
    # Source Nodes: [wrapped_pow], Original ATen: [aten.lift_fresh, aten.pow]
    buf1 = aten.pow(arg0_1, reinterpret_tensor(buf0, (8, ), (0, ), 0))
    del arg0_1
    del buf0
    buf2 = buf1
    assert_size_stride(buf2, (8, ), (1, ))
    del buf1
    return (buf2, )
```

Generated code now:
```
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (8, ), (1, ))
    buf0 = empty_strided((8, ), (1, ), device='cpu', dtype=torch.int64)
    cpp_fused_pow_0(c_void_p(arg0_1.data_ptr()), c_void_p(buf0.data_ptr()))
    del arg0_1
    return (buf0, )
```
@lezcano What would be a good way to add a test for this?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109953
Approved by: https://github.com/lezcano
2023-09-28 13:11:06 +00:00
Avik Chaudhuri
5da5e068f3 deprecate constraints in favor of dynamic_shapes (#110143)
Recently we updated the `export` API to take an experimental `dynamic_shapes` argument that was meant to subsume the existing `constraints` argument.

This PR deprecates `constraints` (with a warning on its use, but without actually removing it). Simultaneously it replaces all uses of `constraints` in docs, examples, and tests with corresponding uses of `dynamic_shapes` (preserving behavior). This exercise fortunately revealed some minor bugs in the implementation which have also been fixed in this PR.

Some uses of `constraints` still remain, e.g., when `torch._dynamo.export` is called directly. (Meta-internal uses will be updated in a separate diff.)

Differential Revision: D49676049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110143
Approved by: https://github.com/tugsbayasgalan
2023-09-28 10:26:21 +00:00
Sindi Shkodrani
419ec3b229 Enable pickling model prepared with QAT qconfig (#109288)
Summary:
Resolving error:

AttributeError: Can't pickle local object '_add_module_to_qconfig_obs_ctr.<locals>.get_factory_kwargs_based_on_module_device'

by moving nested function out to the main module

Test Plan: Added test to CI

Reviewed By: andrewor14

Differential Revision: D49187352

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109288
Approved by: https://github.com/andrewor14
2023-09-28 09:51:19 +00:00
angelayi
c71a64ccce [aotinductor] Rename if name is prefixed with integer (#110113)
Fixes https://github.com/pytorch/pytorch/issues/109894.
Since in c++ we cannot have variables that start with an integer, we can do some additional handling in inductor to not produce constant tensors with names starting with integers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110113
Approved by: https://github.com/desertfire
2023-09-28 07:26:28 +00:00
Brian
e20c35a53b Allow public access for imports (#108914)
Fixes #108776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/108914
Approved by: https://github.com/wanchaol
2023-09-28 06:05:59 +00:00
Jez Ng
fc1fcc4d17 Enable typechecking for _inductor/fx_passes/group_batch_fusion.py (#110111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110111
Approved by: https://github.com/eellison, https://github.com/Skylion007
ghstack dependencies: #110109
2023-09-28 04:53:09 +00:00
Jez Ng
3e7f23e04f [inductor] Actually enable typing for sizevars.py and joint_graph.py (#110109)
The commit message of #107862 says it enabled mypy checking for
sizevars.py, but it seems that it neglected to update .lintrunner.toml.

New type errors appear to have crept in since then, so I've fixed them
accordingly.

A similar mistake happened with #109955 for joint_graph.py, though that
one is more recent and so hasn't had any new type errors to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110109
Approved by: https://github.com/Skylion007
2023-09-28 04:53:09 +00:00
cyy
a81d083b1c [Reland] Add -Wdeprecated and related fixes (#110019)
This is reland of PRs #https://github.com/pytorch/pytorch/pull/108626 and #109564. We fixed the IOS build failure by changing
```
((CHECK) ? (EXPR) : ([] { assert(!#CHECK); }(), (EXPR)))
```
to
```
((CHECK) ? (EXPR) : ([] { assert(false); }(), (EXPR)))
```
in TR2_OPTIONAL_ASSERTED_EXPRESSION, since the former syntax was invalid on Apple Clang. Anyway, we could apply the simple fix hoping that c10::optional would be replaced by std::optional soon.
We also enabled -Wdeprecated on c10.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110019
Approved by: https://github.com/clee2000
2023-09-28 03:34:29 +00:00
Sherlock Huang
7f2b51c668 [AOTInductor] ProxyExecutor supports custom op with tuple output (#110140)
Summary:
Extend ProxyExecutor to support custom ops with tuple outputs.

Generated wrapper code for `out3, out4 = torch.ops.fb.fn_with_tuple_output(out2, 1)`

```
    AtenTensorHandle buf5_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf5_handle));
    RAIIAtenTensorHandle buf5(buf5_handle);
    AtenTensorHandle buf6_handle;  // output buffer
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_new_uninitialized_tensor(&buf6_handle));
    RAIIAtenTensorHandle buf6(buf6_handle);
    AtenTensorHandle tensor_args_var_3[] = {buf3.get(), buf5.get(), buf6.get()};
    int64_t int_args_var_4[] = {1};
    aoti_torch_proxy_executor_call_function(proxy_executor, 1, 1, int_args_var_4, 3, tensor_args_var_3);
```

Test Plan: Test

Differential Revision: D49673994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110140
Approved by: https://github.com/chenyang78
2023-09-28 02:50:39 +00:00
PyTorch MergeBot
75462fd870 Revert "[1/N] Dynamo skipfiles refactor (#109567)"
This reverts commit f8e0ebec8c.

Reverted https://github.com/pytorch/pytorch/pull/109567 on behalf of https://github.com/huydhn due to Many jobs are failing in trunk after this with FILENAME_ALLOWLIST is not defined error f8e0ebec8c. This looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/109567#issuecomment-1738344950))
2023-09-28 02:22:22 +00:00
Matthew Hoffman
68b0db1274 Define the public API for torch.distributed.fsdp (#109922)
Related: https://github.com/pytorch/pytorch/wiki/Public-API-definition-and-documentation
Related: https://github.com/microsoft/pylance-release/issues/2953

This fixes pylance issues for these classes:

```
"FullyShardedDataParallel" is not exported from module "torch.distributed.fsdp"
```

These classes all have public docs:

* [`BackwardPrefetch`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.BackwardPrefetch)
* [`CPUOffload`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.CPUOffload)
* [`FullyShardedDataParallel`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel)
* [`MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision)
* [`ShardingStrategy`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.ShardingStrategy)

And it seems like all the newly added classes will have docs once they are released.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109922
Approved by: https://github.com/wanchaol
2023-09-28 02:15:58 +00:00
Joel Schlosser
f5a23ca78d Make unbind() overrideable for NT subclass (#109122)
Goal: avoid making unbind composite implicit so we can override it within `__torch_dispatch__()` for the NT subclass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109122
Approved by: https://github.com/cpuhrsch, https://github.com/soulitzer
2023-09-28 01:26:22 +00:00
Yanbo Liang
f8e0ebec8c [1/N] Dynamo skipfiles refactor (#109567)
This is 1/N of the dynamo skipfiles/allowed_functions refactor, the major change in this PR includes:
* Refactor & define the [skipfiles rules](https://github.com/pytorch/pytorch/pull/109567/files#diff-5aa3ce9db729bf0901ea97a5d3cc51924cc8575d9c516c1c8f572a35de92544aR56) and interface
* For every ```skipfiles.check```, we return both the check result and the skip/inline reason and log them for debugging.
* We found several latent issues/bugs and incorrect implementations in the codebase, but I'm planning to fix them in follow-up PRs to make the refactor decoupled with bug fixes.
* More details in the inline comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109567
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/anijain2305
2023-09-28 01:21:59 +00:00
SS-JIA
22e706f768 [core IR] Add lift_fresh, split.Tensor, and unbind decompositions to core ATen decomp table (#110102)
## Context

Add existing decomps for `lift_fresh`, `split.Tensor`, and `unbind` to the core ATen decomposition table. Do not use them in inductor, since Inductor currently lowers these directly.

One note though is that `lift_fresh`'s decomposition has a note saying it's not correct under autograd. However, my understanding is that these decompositions are registered to the `"post_autograd"` decomposition table, meaning autograd wouldn't be a factor. Would like some confirmation that this premise is correct.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110102
Approved by: https://github.com/jansel
2023-09-28 01:21:45 +00:00
Mu-Chu Lee
840bb650f8 [AOTInductor] Update regex rule for symbol (#110184)
Summary:
Update regex rule to match _ letter.

Test Plan:
Included in commit

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110184
Approved by: https://github.com/desertfire
2023-09-28 01:13:18 +00:00
CaoE
9399e0b1ff add fp16 support for gemm (#99498)
### Testing

Native matmul vs. mkldnn matmul  on SPR (with avx512_fp16 support)

single core:

Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 2010.387 | 64.700 | 31.072
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 4027.116 | 107.780 | 37.364
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 28685868.488 | 90663.008 | 316.401

56 cores:
Input | Naïve impl   / ms | oneDNN /   ms | Speed up
-- | -- | -- | --
M: 128, N: 128, K: 128, trans_a: False, trans_b: False | 5.091 | 0.24 | 211.30
M: 128, N: 128, K: 128, trans_a: False, trans_b: True | 5.224 | 0.23 | 220.09
M: 128, N: 256, K: 128, trans_a: False, trans_b: False | 10.006 | 0.30 | 330.31
M: 8192, N: 768, K: 768, trans_a: False, trans_b: False | 29435.372 | 1.770 | 1662.80
M: 8192, N: 768, K: 768, trans_a: False, trans_b: True | 31464.961 | 1.728 |  18204.76
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False | 115035.849  | 7.990 | 14396.90
M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True | 122981.023 |  7.725 | 15918.34
Batch: 768, M: 128, N: 64, K: 128  | 2032.523 | 0.705 | 2882.23

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99498
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-09-28 01:03:50 +00:00
Peter Bell
d796518485 [refs] Fix size check from #108360 (#109083)
PR #108360 uses the same default `last_dim_size` formula from complex-to-real (C2R) transforms for
complex-to-complex (C2C) and real-to-complex (R2C). However, this is not correct because for C2R
the input is only half the size of the full tensor, which is not the case for C2C and C2R.

This error is mostly benign since `last_dim_size` was only used for the `>= 1` condition which is
almost always met anyway.

For this PR I now use it as the argument to `_apply_norm` which makes it load-bearing for correctness
and so is thoroughly tested now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109083
Approved by: https://github.com/lezcano
2023-09-27 23:59:29 +00:00
Mark Saroufim
40b83d98de fix bugs in export docstrings (#110169)
First error

```
Traceback (most recent call last):
  File "/home/ubuntu/exporty.py", line 8, in <module>
    ep = torch.export.export(MyModule(), torch.randn(5))
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 509, in export
    return export(f, args, kwargs, constraints)
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 314, in export
    raise UserError(UserErrorType.INVALID_INPUT,
torch._dynamo.exc.UserError: Expecting `args` to be a tuple of example positional inputs, got <class 'torch.Tensor'>
```

Second error

```
(sam) ubuntu@ip-172-31-9-217:~$ python exporty.py
Traceback (most recent call last):
  File "/home/ubuntu/exporty.py", line 13, in <module>
    torch.export.save(ep, 'exported_program.pt2', extra_files=extra_files)
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/export/__init__.py", line 566, in save
    save(ep, f, extra_files=extra_files, opset_version=opset_version)
  File "/opt/conda/envs/sam/lib/python3.10/site-packages/torch/_export/__init__.py", line 595, in save
    encoded_content = content.encode('utf-8')
AttributeError: 'bytes' object has no attribute 'encode'. Did you mean: 'decode'?
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110169
Approved by: https://github.com/angelayi
2023-09-27 22:56:42 +00:00
Tugsbayasgalan Manlaibaatar
bf7307adf8 Support inference_mode decorator (#109274)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109274
Approved by: https://github.com/williamwen42
2023-09-27 22:21:42 +00:00
Michael Voznesensky
2ff9d1fda3 Add size to constant - type dispatche through BaseListVariable.cls_for (#110166)
Differential Revision: D49689895

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110166
Approved by: https://github.com/anijain2305
2023-09-27 21:44:16 +00:00
Mu-Chu Lee
7782108792 [AOTIndutor] Fix freeze for AOTInductor (#110055)
Summary:
Add test for freeze graph in AOTInductor.
Remove unused code path.

Test Plan:
Included in commit.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110055
Approved by: https://github.com/angelayi
2023-09-27 21:21:47 +00:00
Animesh Jain
213badf632 [dynamo][guards-log] Add debug msg for nn_module_guards only when log is enabled (#110167)
I did not do any benchmarks, but there could be a small overhead of creating the debug_msg. Adding debug_msg only when guards log is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/110167
Approved by: https://github.com/ezyang
2023-09-27 21:11:44 +00:00
Jon Chuang
6aae636f69 chore(inductor): Simplify will_fusion_create_cycle and cleanup to node.ancestors (#109976)
recursive_predecessors == ancestors so rename.

Improve comments

Simplify `will_fusion_create_cycle` - make it easier to read and add detailed comments.

Diagram to illustrate clarification of shortcut.
![Inductor Deep Dive](https://github.com/pytorch/pytorch/assets/9093549/7a30e088-8a33-4a9c-a8a7-81199cd086e2)

CC: @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109976
Approved by: https://github.com/jansel
2023-09-27 20:48:53 +00:00