Commit Graph

1383 Commits

Author SHA1 Message Date
yanbing-j
dc40b6d043 Upgrade oneDNN to v2.7.2 (#90051)
This PR is to upgrade oneDNN to v2.7.2.

### oneDNN v2.7.1 & 2.7.2 changes:
Fixes #89104
Updated ITT API version to 3.23.0

### Performance Benchmark
Use TorchBench test in ICX with 40 cores
Intel OpenMP & tcmalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/205240855-04e2d50f-8b3a-4097-9038-fdd0c0fc93b9.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/90051
Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5
2022-12-08 09:41:02 +00:00
Facebook Community Bot
3ef4fc2012 Automated submodule update: FBGEMM (#74729)
This is an automated pull request to update the first-party submodule for [pytorch/FBGEMM](https://github.com/pytorch/FBGEMM).

New submodule commit: f99e161663

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74729
Approved by: https://github.com/malfet
2022-12-07 22:36:35 +00:00
PyTorch MergeBot
0d8e53dfe7 Revert "[Composable API] replicate: change to per module call, remove mark_root_module() (#89222)"
This reverts commit 65a0dcffd8.

Reverted https://github.com/pytorch/pytorch/pull/89222 on behalf of https://github.com/malfet due to Included unintended submodule updates
2022-12-06 03:26:28 +00:00
Charlie Yan
65a0dcffd8 [Composable API] replicate: change to per module call, remove mark_root_module() (#89222)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89222
Approved by: https://github.com/zhaojuanmao
2022-12-05 17:54:55 +00:00
Nikita Shulga
f2cf1b0f5e Revert submodule updates introduced by #89157 (#89449)
Reverts updates that were introduced by https://github.com/pytorch/pytorch/pull/89157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89449
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/clee2000
2022-11-22 05:48:43 +00:00
Taylor Robie
cf9476554f update kineto pinned commit (#89435)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89435
Approved by: https://github.com/malfet
2022-11-21 17:32:29 +00:00
yanbing-j
a80e5e7813 Update ideep for future performance improvement (#87966)
**Summary**
The update includes API changes and optimzations to reduce framework overhead, which will benefit all mkldnn (onednn) ops in JIT mode and inductor CPU backend, etc. These benefits will be seen after switching to new ideep API by future PRs.

**Test plan**
For correctness, all UTs that call mkldnn ops, including test_ops.py, test_mkldnn*.py, test_quantization.py, etc.
For performance, TorchBench has been run and no regression is found. Results are shown below.
- Intel (R) Xeon (R) IceLake with 40 cores
- Use multi-instance
- Using tcmalloc & Intel OMP

![image](https://user-images.githubusercontent.com/12522207/201631004-bb77468d-953b-4757-a001-94d44615b5f6.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87966
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper
2022-11-21 09:52:36 +00:00
Nikita Shulga
ea58955dda Move bazel to c++17 (#89297)
Splitting out various smaller pieces from https://github.com/pytorch/pytorch/pull/85969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89297
Approved by: https://github.com/huydhn
2022-11-19 01:13:08 +00:00
Zain Rizvi
ab75982d3a Always retry curl downloads (#89157)
Modify our curl commands so that they always retry downloads.

By default, curl only retries what it considers to be "transient" errors, based on the server's response. However, curl's estimate of what's transient is very conservative.  By adding the --retry-all-errors parameter we'll always retry curl commands.

In particular, I'm hoping this mitigates errors where curl fails with the below error ([logs](https://github.com/pytorch/pytorch/actions/runs/3468758110/jobs/5794939941))
`curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to ossci-linux.s3.amazonaws.com:443`

Some of the modified downloads didn't even have retries, so I added them in

More details: https://everything.curl.dev/usingcurl/downloads/retry
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89157
Approved by: https://github.com/kit1980, https://github.com/malfet
2022-11-18 07:03:24 +00:00
Kenichi Maehashi
e2f0648750 Add an option to include actual license terms to the output (#85624)
When building products using PyTorch, it is often required to display license terms for all dependencies.
The feature itself has been implemented in #81500 but it seems there are no options to enable it.
This PR implements the option.

cc/ @mattip @rgommers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85624
Approved by: https://github.com/rgommers, https://github.com/seemethere
2022-11-16 05:07:53 +00:00
Nikita Shulga
6be426ca1a Update gloo submodule (#88530)
Also, add an explicit cudart dependency to `torch_cuda` if Kineto is used with GPU support (it used to be somehow inherited from a wrong `gloo` setup)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88530
Approved by: https://github.com/osalpekar
2022-11-09 01:04:32 +00:00
Aaron Gokaslan
5fb9c113ae Update pybind11 to v2.10.1 (#88332)
I am one of the maintainers of pybind11, and a frequent PyTorch user. We added quite a lot of bugfixes and performance improvements in 2.10.1 (see the changelog for full details) and I wanted to upstream them to PyTorch.

Our releases is tested throughout Google's codebase including on their global builds of PyTorch so there should be no surprises.

The main new feature is optin in Eigen Tensor to Numpy casters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88332
Approved by: https://github.com/soumith
2022-11-03 02:53:26 +00:00
Minh Nguyen
bd4c4537dc aten cpu and xnnpack to be compatible with arvr mode build (#87125)
Summary:
When building 3d photo sdk generator package in arvr/mode/mac and arvr/mode/mac-arm modes, we got several issues with aten cpu and xnnpack libraries.

The reason is that those packages are using platform-* properties (platform-deps, platform-srcs...) which are not compatible with arvr modes.

This diff fixes those issues by using `select` for non-platform properties when is_arvr_mode() is true, while keeping those platform ones for non-arvr modes.

Test Plan:
```
buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/dev
buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac-arm/opt

buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/dev
buck build //arvr/projects/compphoto/photo3d_sdk/unity/plugin:generator_plugin_shared arvr/mode/mac/opt
```

and sandcastle builds

Differential Revision: D40028669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87125
Approved by: https://github.com/kimishpatel
2022-10-25 22:52:52 +00:00
Christian Puhrsch
f6c6048b10 Use CUTLASS GEMM for NT bmm (#85894)
Copy of https://github.com/pytorch/pytorch/pull/85710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894
Approved by: https://github.com/drisspg
2022-10-18 23:11:47 +00:00
Jiang, Yanbing
c56be31d2e Upgrade oneDNN to v2.7 (#87061)
This PR is to upgrade oneDNN to v2.7.

### oneDNN v2.7 changes:

**Performance Optimizations**
- Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids).
- Introduced performance optimizations for [bf16 floating point math mode](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data.

Please go to https://github.com/oneapi-src/oneDNN/releases/tag/v2.7 for more detailed changes.

### oneDNN v2.6.1 & 2.6.2 changes:

**Functionality**

- Updated ITT API to 3.22.5
- Fixed correctness issue in fp32 convolution implementation for cases with large spatial size (https://github.com/pytorch/pytorch/issues/84488)

### Performance Benchmark
Use TorchBench test in ICX with 40 cores
Intel OpenMP & tcmalloc were preloaded
![image](https://user-images.githubusercontent.com/61222868/196121957-656faebc-9f4a-49f0-9ef0-0784416c3a47.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/87061
Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/weiwangmeta
2022-10-18 19:07:58 +00:00
sanchitintel
974ad8fa6c Add BFloat16 dtype support for oneDNN Graph JIT fuser (#85591)
## BFloat16 dtype support for faster inference with TorchScript using oneDNN Graph

Intel Xeon Cooper Lake platform & beyond support the `AVX512_BF16` ISA, which is essentially native BFloat16 support.
oneDNN Graph delivers high inference performance with BFloat16 on such machines.

While oneDNN Graph can still be used with BFloat16 on older machines that lack `avx512_bf16` ISA but support `avx512bw`, `avx512vl` & `avx512dq` ISAs, the BF16 performance on these older machines will be significantly poorer (probably even poorer than Float32), as they lack native BF16 support.

Currently, [AMP support for eager mode & JIT mode is divergent in PyTorch](https://github.com/pytorch/pytorch/issues/75956).
So, for using oneDNN Graph with BFloat16, eager-mode AMP should be leveraged by turning off AMP for JIT mode, using `torch._C._jit_set_autocast_mode(False)` in python code, so as to avoid conflicts.

Please use the following environment variable to view JIT logs -
`PYTORCH_JIT_LOG_LEVEL=">>graph_helper:>>graph_fuser:>>kernel:>>interface"`

## Changes being made in this PR
1. This PR does NOT change the `oneDNN` commit or the `ideep` files. While the `ideep` commit is being updated, only files pertaining to oneDNN Graph are being updated. oneDNN Graph is being upgraded to version 0.5.2 (alpha patch release 2).
To put things into perspective, `ideep` is a git submodule of PyTorch. `oneDNN Graph` is a git submodule of `ideep` (`ideep/mkl-dnn`), and oneDNN is a git submodule of oneDNN Graph (`ideep/mkl-dnn/third_party/oneDNN`).
2. Unit-tests are being updated. We now use the [existing dtypes decorator](https://github.com/pytorch/pytorch/blob/master/torch/testing/_internal/common_device_type.py#L123-L131).
3. Suggestions made by @eellison in the [FP32 PR](https://github.com/pytorch/pytorch/pull/68111#pullrequestreview-896719477) are being incorporated/addressed -

| Action-item | Status |
| :---                                             |          ---: |
|checkInputCompatibility follow up | Fixed |
|the mayConvertScalarInputToTensor logic we can consider | Added type promotion code |
|fix up fixConvOptionalBias| The current approach seems correct |
|Use opinfo tests| using dtypes decorator. Will use `OpInfo` in a subsequent PR, if that'd be possible. Should we create a list of ops from opDB that are supported by oneDNN Graph, and add it to `common_methods_invocations.py`? |
|inferDevice torch_check call | not necessary now, perhaps, as only CPU is supported, for now? We'd add it by the beta release of oneDNN Graph, though, so that by then, users might be able to use other fusers with oneDNN Graph (NNC/TensorExpr are already compatible with the oneDNN Graph fuser). We can still add it, if you'd insist. |
|not checking shapes of input mkldnn tensor to llga guard | Those checks should not be present because oneDNN Graph may use blocked or channels-last layout, so those strides would be different. They're only skipped if an LLGA subgraph's output is input to another LLGA subgraph, which enables LLGA to choose an optimal layout between them. |
|fix test failures with respect to unsupported inputs | We'll address them with the upcoming release of oneDNN Graph beta version|

4. More PyTorch ops are being been mapped to oneDNN Graph

## Example of using oneDNN Graph with BFloat16

```python
# Assuming we have a model of the name 'model'

example_input = torch.rand(1, 3, 224, 224)

# enable oneDNN Graph
torch.jit.enable_onednn_fusion(True)
# Disable AMP for JIT
torch._C._jit_set_autocast_mode(False)
with torch.no_grad(), torch.cpu.amp.autocast():
    model = torch.jit.trace(model, (example_input))
    model = torch.jit.freeze(model)
     # 2 warm-ups (2 for tracing/scripting with an example, 3 without an example)
    model(example_input)
    model(example_input)

    # speedup would be observed in subsequent runs.
    model(example_input)
```

## TorchBench based Benchmarks
**URL:** https://github.com/sanchitintel/benchmark/tree/onednn_graph_benchmark (instructions present at URL).
**Batch-size(s):** TorchBench-default for each model
**Baseline :** PyTorch JIT OFI FP32
**Machine:** Intel(R) Xeon(R) Platinum 8371HC (Cooper Lake)
**Sockets used**: 1
**Number of cores on one socket**: 26
Intel OpenMP & tcmalloc were preloaded

#### Benchmark results with single thread
| name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[alexnet-cpu-jit]                       |      1.063851 |        0.509820 |     -52.1% |
| test_eval[mnasnet1_0-cpu-jit]                    |      0.218435 |        0.107100 |     -51.0% |
| test_eval[mobilenet_v2-cpu-jit]                  |      0.114467 |        0.058359 |     -49.0% |
| test_eval[mobilenet_v3_large-cpu-jit]            |      0.233873 |        0.117614 |     -49.7% |
| test_eval[resnet18-cpu-jit]                      |      0.160584 |        0.075854 |     -52.8% |
| test_eval[resnet50-cpu-jit]                      |      1.652846 |        0.713373 |     -56.8% |
| test_eval[resnext50_32x4d-cpu-jit]               |      0.471174 |        0.209431 |     -55.6% |
|test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.310306 | 0.167090 | -46.2% |
| test_eval[squeezenet1_1-cpu-jit]                 |      0.161247 |        0.045684 |     -71.7% |
| test_eval[timm_efficientnet-cpu-jit]             |      1.643772 |        0.800099 |     -51.3% |
| test_eval[timm_regnet-cpu-jit]                   |      5.732272 |        2.333417 |     -59.3% |
| test_eval[timm_resnest-cpu-jit]                  |      1.366464 |        0.715252 |     -47.7% |
| test_eval[timm_vision_transformer-cpu-jit]       |      0.508521 |        0.271598 |     -46.6% |
| test_eval[timm_vovnet-cpu-jit]                   |      2.756692 |        1.125033 |     -59.2% |
| test_eval[vgg16-cpu-jit]                         |      0.711533 |        0.312344 |     -56.1% |

#### Benchmark results with 26 threads:
| name                                             | latency of PyTorch JIT OFI FP32 (s) |   Latency of oneDNN Graph BF16 (s) |   % change |
| :---                                             |          ---: |            ---: |       ---: |
| test_eval[alexnet-cpu-jit]                       |      0.062871 |        0.034198 |     -45.6% |
| test_eval[mnasnet1_0-cpu-jit]                    |      0.022490 |        0.008172 |     -63.7% |
| test_eval[mobilenet_v2-cpu-jit]                  |      0.012730 |        0.005866 |     -53.9% |
| test_eval[mobilenet_v3_large-cpu-jit]            |      0.025948 |        0.010346 |     -60.1% |
| test_eval[resnet18-cpu-jit]                      |      0.011194 |        0.005726 |     -48.9% |
| test_eval[resnet50-cpu-jit]                      |      0.124662 |        0.045599 |     -63.4% |
| test_eval[resnext50_32x4d-cpu-jit]               |      0.034737 |        0.015214 |     -56.2% |
|test_eval[shufflenet_v2_x1_0-cpu-jit] | 0.028820 | 0.012517 | -56.6% |
| test_eval[squeezenet1_1-cpu-jit]                 |      0.012557 |        0.003876 |     -69.1% |
| test_eval[timm_efficientnet-cpu-jit]             |      0.203177 |        0.051879 |     -74.5% |
| test_eval[timm_regnet-cpu-jit]                   |      0.452050 |        0.151113 |     -66.6% |
| test_eval[timm_resnest-cpu-jit]                  |      0.117072 |        0.052848 |     -54.9% |
| test_eval[timm_vision_transformer-cpu-jit]       |      0.046048 |        0.023275 |     -49.5% |
| test_eval[timm_vovnet-cpu-jit]                   |      0.213187 |        0.077482 |     -63.7% |
| test_eval[vgg16-cpu-jit]                         |      0.044726 |        0.021998 |     -50.8% |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85591
Approved by: https://github.com/jgong5, https://github.com/frank-wei, https://github.com/chunyuan-w
2022-10-13 20:36:59 +00:00
PyTorch MergeBot
d169f950da Revert "Use CUTLASS GEMM for NT bmm [OSS-only] (#85894)"
This reverts commit ef58a132f2.

Reverted https://github.com/pytorch/pytorch/pull/85894 on behalf of https://github.com/DanilBaibak due to Break internal build
2022-10-13 15:28:09 +00:00
Christian Puhrsch
ef58a132f2 Use CUTLASS GEMM for NT bmm [OSS-only] (#85894)
OSS-only copy of https://github.com/pytorch/pytorch/pull/85710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85894
Approved by: https://github.com/drisspg
2022-10-12 20:03:28 +00:00
Nikita Shulga
09364f4298 Compile C10 with Wshadow (#86666)
This should prevent further regressions like https://github.com/pytorch/pytorch/pull/86646
Update `fmt` to `7.1.0` to fix variable shadowing in that library

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86666
Approved by: https://github.com/seemethere
2022-10-11 22:39:58 +00:00
Jianyu Huang
577070ff96 update fbgemm commit ID in PyTorch (#86577)
Summary:
Update after https://github.com/pytorch/FBGEMM/pull/1388 .

Previous issue: D40216348

Test Plan: CI

Differential Revision: D40219252

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86577
Approved by: https://github.com/malfet
2022-10-11 02:15:53 +00:00
Nikita Shulga
6a1e3f2f37 Update fbgemm submodule (#86054)
Reland of 481def752c
Fixes https://github.com/pytorch/pytorch/issues/85956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86054
Approved by: https://github.com/xuzhao9
2022-10-03 05:51:22 +00:00
Nikita Shulga
b9b24c31fd [MPS] Fix non-contig to contig tensor copy (#86056)
This handles a rare case when MPS tensor is constructed from non-contiguous CPU tensor.
Fixes https://github.com/pytorch/pytorch/issues/85967

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86056
Approved by: https://github.com/janeyx99
2022-10-02 20:13:05 +00:00
Nikita Shulga
481def752c Update fbgemm submodule (#86054)
Fixes https://github.com/pytorch/pytorch/issues/85956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/86054
Approved by: https://github.com/xuzhao9
2022-10-02 15:05:34 +00:00
Peter Bell
9a81da7ad1 Update NCCL to current master and remove patch step (#85367)
The patch from #84245 has been upstreamed into NCCL, so the patch step is no longer required.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/85367
Approved by: https://github.com/ezyang
2022-09-21 19:23:49 +00:00
PyTorch MergeBot
35088f283e Revert "Python stack tracing OD flow (part 1) (#84362)"
This reverts commit 1f4f05e59c.

Reverted https://github.com/pytorch/pytorch/pull/84362 on behalf of https://github.com/malfet due to Broke CUDA-10.2 tests, see 1f4f05e59c
2022-09-20 03:42:43 +00:00
Seonglyong Gong
1f4f05e59c Python stack tracing OD flow (part 1) (#84362)
Summary: submodule update

Test Plan: CI

Differential Revision: D39176686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/84362
Approved by: https://github.com/robieta
2022-09-19 21:33:55 +00:00
atalman
25d91e0a9d Updating cudnn_frontend to 0.7.1 (#84943)
Updating cudnn_frontend to 0.7.1 To enable CUDNN 8.5 integration

cc @malfet @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/84943
Approved by: https://github.com/huydhn, https://github.com/malfet
2022-09-13 23:00:09 +00:00
Driss Guessous
0fc02dbba4 flash_attention integration (#81434)
# Summary:
- I added a new submodule Cutlass pointing to 2.10 release. The inclusion of flash_attention code should be gated by the flag: USE_FLASH_ATTENTION. This is defaulted to off resulting in flash to not be build anywhere. This is done on purpose since we don't have A100 machines to compile and test on.

- Only looked at CMake did not attempt bazel or buck yet.

-  I included the mha_fwd from flash_attention that has ben refactored to use cutlass 2.10. There is currently no backwards kernel on this branch. That would be a good follow up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81434
Approved by: https://github.com/cpuhrsch
2022-09-09 20:11:26 +00:00
Stephen Jia
732255f031 [vulkan] Add VMA as a third_party subrepo (#83906)
the [VulkanMemoryAllocator](https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator) is a popular library for GPU memory allocation using Vulkan. The Vulkan backend has a dependency on it, but since it is only a single header file we currently include it by checking it into the repo under [aten/src/ATen/native/vulkan/api/vk_mem_alloc.h](https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/vulkan/api/vk_mem_alloc.h). However, it is better to check it in as a third party submodule, since it allows better version tracking.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83906
Approved by: https://github.com/kimishpatel
2022-08-23 18:42:46 +00:00
Huy Do
f0ee21fe0a Update cpuinfo to the latest commit (#83620)
This hasn't been updated for a while, so pulling the latest commit from https://github.com/pytorch/cpuinfo. I wonder if it breaks anything

Fixes #83594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/83620
Approved by: https://github.com/malfet
2022-08-20 06:16:54 +00:00
yanbing-j
6dc8673b1b Update ideep for NNC post-op (#82705)
### Description
This PR is to add NNC post-op fusion support in ideep for further NNC development. It includes:

- element wise post op fusion
- conv/matmal/linear + binary post op fusion

### Performance
**Common configuration:**
- Jemalloc and iomp enabled
- BS=1
- num_warmup = 300
- num_run = 500
- Average time of 1 iteration in ms is used
- time_before: no fusion
- time_after: with fusion
- Eltwise OPs selected: hardswish and abs
- Using oneDNN v2.6

**On ICX (32 cores per socket):
Conv2d FP32 (in channels Last format)**

  | shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.112174 | 0.071106 | 36.61%
1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.11269 | 0.070586 | 37.36%
1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.164219 | 0.129498 | 21.14%
1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.169371 | 0.1277 | 24.60%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.994555 | 1.429813 | 28.31%
1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.715168 | 1.459937 | 14.88%
1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 2.997382 | 2.47915 | 17.29%
1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.044476 | 2.499366 | 17.90%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.405204 | 0.38117 | 5.93%
4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.410145 | 0.389279 | 5.09%
4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.67917 | 0.662792 | 2.41%
4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.682302 | 0.671226 | 1.62%

**On CPX (28 cores per socket):
Conv2d BF16 (in channels Last format)**

  | shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.119289 | 0.091015 | 23.70%
1socket | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.144116 | 0.09339 | 35.20%
1socket | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.209975 | 0.177111 | 15.65%
1socket | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.234777 | 0.179945 | 23.36%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.296252 | 1.086423 | 16.19%
1thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 1.364738 | 1.131289 | 17.11%
1thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 3.99519 | 3.736147 | 6.48%
1thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 4.03415 | 3.77981 | 6.30%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Conv+abs_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.27474 | 0.245281 | 10.72%
4thread | Conv+hardswish_kernel=3_N=1_iC=64_H=56_W=56_oC=64_stride=1_pad=1_dilates=1_groups=1 | 0.28595 | 0.254748 | 10.91%
4thread | Conv+abs_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.847318 | 0.791453 | 6.59%
4thread | Conv+hardswish_kernel=3_N=1_iC=512_H=56_W=56_oC=512_stride=2_pad=1_dilates=1_groups=32 | 0.870212 | 0.801594 | 7.89%

**On CPX (28 cores per socket):
Linear BF16**

  | shape | time_(ms)_before | time_(ms)_after | Gain
-- | -- | -- | -- | --
1socket | Linear+abs_N=1_iC=1024_oC=4096 | 0.043199 | 0.037603 | 12.95%
1socket | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.041845 | 0.038332 | 8.40%
1socket | Linear+abs_N=1_iC=4096_oC=1024 | 0.048282 | 0.044281 | 8.29%
1socket | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.048362 | 0.044106 | 8.80%
1socket | Linear+abs_N=1_iC=2048_oC=1000 | 0.036302 | 0.0344 | 5.24%
1socket | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.035734 | 0.035593 | 0.39%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
1thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.365143 | 0.36279 | 0.64%
1thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.364464 | 0.363392 | 0.29%
1thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.384498 | 0.379902 | 1.20%
1thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.382545 | 0.381252 | 0.34%
1thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.213244 | 0.209999 | 1.52%
1thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.212003 | 0.208567 | 1.62%
  |   |   |   |  
  | shape | time_(ms)_before | time_(ms)_after | Gain
4thread | Linear+abs_N=1_iC=1024_oC=4096 | 0.126096 | 0.12157 | 3.59%
4thread | Linear+hardswish_N=1_iC=1024_oC=4096 | 0.126627 | 0.121662 | 3.92%
4thread | Linear+abs_N=1_iC=4096_oC=1024 | 0.132845 | 0.128921 | 2.95%
4thread | Linear+hardswish_N=1_iC=4096_oC=1024 | 0.132642 | 0.12783 | 3.63%
4thread | Linear+abs_N=1_iC=2048_oC=1000 | 0.079582 | 0.072584 | 8.79%
4thread | Linear+hardswish_N=1_iC=2048_oC=1000 | 0.077761 | 0.071981 | 7.43%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82705
Approved by: https://github.com/frank-wei, https://github.com/eellison
2022-08-18 05:08:12 +00:00
Nikita Shulga
c08092fdf2 Update NCCL to v2.13.4-1 (#82775)
Also, update slimming script to include two instances of net.o that new library generates
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82775
Approved by: https://github.com/ngimel
2022-08-04 19:36:45 +00:00
zengk95
d0e6e5a5bb Revert "sym_numel (#82374)" (#82726)
TSIA

It looks like this PR #82374  is breaking mac builds on trunk but I can't revert it normally since there's a merge conflict in the XLA hash.
<img width="1753" alt="image" src="https://user-images.githubusercontent.com/34172846/182644661-b7fdda4b-e5ce-45c3-96a2-ad6737d169ae.png">

I reverted it and resolved the conflict using the old XLA hash that this commit was based upon
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82726
Approved by: https://github.com/albanD, https://github.com/janeyx99
2022-08-03 15:23:47 +00:00
Nikolay Korovaiko
fd68b0931f sym_numel (#82374)
### Description
This PR makes `numel` symint-aware similar to `sym_sizes()` and `sym_strides()`. Similar to https://github.com/pytorch/pytorch/pull/81300 . This PR is the part of a bigger project to support dynamic_shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82374
Approved by: https://github.com/ezyang
2022-08-03 06:33:45 +00:00
Jianyu Huang
916a565151 Upgrade fbgemm in OSS PyTorch (#82676)
Differential Revision: D38368525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/82676
Approved by: https://github.com/ngimel
2022-08-03 00:28:43 +00:00
albanD
4b7de26556 Fix C API to be compatible with latest 3.11 beta (#81242)
Based off https://github.com/pytorch/pytorch/pull/80511 with extra changes:
- Update pybind to the latest release as it contains some needed fixes
- Extend the compat header to do reduce changes in code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81242
Approved by: https://github.com/malfet, https://github.com/mattip
2022-07-27 08:37:10 +00:00
Max Ren
0b3a239e85 [pocket fft] turning on pocketfft flag (#81670)
Summary:
enabling AT_POCKETFFT_ENABLED@ flag and adding the appropriate dependencies to aten-cpu

moved mkl files from
`aten_cpu_source_non_codegen_list` to
`aten_native_source_non_codegen_list`

Test Plan:
After building testing binaries for both android and ios targets

### iOS
`fbcode/aibench/specifications/frameworks/pytorch/ios/build.sh`

Submitted benchmarks with the new binaries supporting pocketfft here:
https://www.internalfb.com/intern/aibench/details/245253003946591

### Android
`fbcode/aibench/specifications/frameworks/pytorch/android/arm64/build.sh`

Submitted Benchmarks with the new binaries supporting pocket fft here:
https://www.internalfb.com/intern/aibench/details/406253690682941

### Build Size Impact

Success: igios-pika on D37790257-V7

☷[pocket fft] turning on pocketfft flag☷
Diff: https://fburl.com/diff/exkploof
Unigraph Explorer: https://fburl.com/mbex/aipdzaqo

Changes for variation [arm64 + 3x assets]:
```Compressed  : -473 B (-0.00%) => 86.69 MiB
Uncompressed: +2.4 KiB (+0.00%) => 187.71 MiB
```

Reviewed By: kimishpatel

Differential Revision: D37790257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81670
Approved by: https://github.com/kit1980
2022-07-21 02:45:20 +00:00
PyTorch MergeBot
7408004454 Revert "[Codemod][Format buck files with arc lint] caffe2/third_party (#81441)"
This reverts commit 1233c3c256.

Reverted https://github.com/pytorch/pytorch/pull/81441 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-07-19 09:57:32 +00:00
James Donald
1233c3c256 [Codemod][Format buck files with arc lint] caffe2/third_party (#81441)
Reviewed By: jdonald

Differential Revision: D37710887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/81441
Approved by: https://github.com/malfet
2022-07-18 17:10:23 +00:00
mattip
37474a54de create a concated LICENSE file for wheels (#81500)
Fixes #81181 by creating a temporary LICENCE file that has all the third-party licenses concatenated together when creating a wheel. Also update the `third_party/LICENSES_BUNDLED.txt` file.

The `third_party/LICENSES_BUNDLED.txt` file is supposed to be tested via `tests/test_license.py`, but the test is not running?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81500
Approved by: https://github.com/rgommers, https://github.com/seemethere
2022-07-18 14:02:37 +00:00
Taylor Robie
9d3c35d1e1 Back out "Revert D37720837: Back out "Revert D37228314: [Profiler] Include ActivityType from Kineto"" (#81450)
Differential Revision: [D37842341](https://our.internmc.facebook.com/intern/diff/D37842341/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37842341/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81450
Approved by: https://github.com/pbelevich
2022-07-15 18:25:40 +00:00
Jing Xu
3c7044728b Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)
More detailed description of benefits can be found at #41001. This is Intel's counterpart of NVidia’s NVTX (https://pytorch.org/docs/stable/autograd.html#torch.autograd.profiler.emit_nvtx).

ITT is a functionality for labeling trace data during application execution across different Intel tools.
For integrating Intel(R) VTune Profiler into Kineto, ITT needs to be integrated into PyTorch first. It works with both standalone VTune Profiler [(https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html](https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html)) and Kineto-integrated VTune functionality in the future.
It works for both Intel CPU and Intel XPU devices.

Pitch
Add VTune Profiler's ITT API function calls to annotate PyTorch ops, as well as developer customized code scopes on CPU, like NVTX for NVidia GPU.

This PR rebases the code changes at https://github.com/pytorch/pytorch/pull/61335 to the latest master branch.

Usage example:
```
with torch.autograd.profiler.emit_itt():
    for i in range(10):
        torch.itt.range_push('step_{}'.format(i))
        model(input)
        torch.itt.range_pop()
```

cc @ilia-cher @robieta @chaekit @gdankel @bitfort @ngimel @orionr @nbcsm @guotuofeng @guyang3532 @gaoteng-git
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63289
Approved by: https://github.com/malfet
2022-07-13 13:50:15 +00:00
PyTorch MergeBot
36d2c44cce Revert "Back out "Revert D37228314: [Profiler] Include ActivityType from Kineto" (#81122)"
This reverts commit 52a538868b.

Reverted https://github.com/pytorch/pytorch/pull/81122 on behalf of https://github.com/clee2000 due to broke periodic buck build https://github.com/pytorch/pytorch/runs/7306516655?check_suite_focus=true
2022-07-12 18:20:00 +00:00
Taylor Robie
52a538868b Back out "Revert D37228314: [Profiler] Include ActivityType from Kineto" (#81122)
Reland

Differential Revision: [D37720837](https://our.internmc.facebook.com/intern/diff/D37720837/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37720837/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81122
Approved by: https://github.com/chaekit
2022-07-12 14:54:01 +00:00
PyTorch MergeBot
a965a67492 Revert "[Profiler] Include ActivityType from Kineto (#80750)"
This reverts commit 2f6f7391ef.

Reverted https://github.com/pytorch/pytorch/pull/80750 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally
2022-07-08 05:16:56 +00:00
Taylor Robie
2f6f7391ef [Profiler] Include ActivityType from Kineto (#80750)
We don't want to compile with Kineto on all platforms, but if we're going to have significant integration between profiler and Kineto profiler will need to be able to rely on simple API constructs like the Kineto enums.

Differential Revision: [D37228314](https://our.internmc.facebook.com/intern/diff/D37228314/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D37228314/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80750
Approved by: https://github.com/aaronenyeshi
2022-07-08 04:59:06 +00:00
PyTorch MergeBot
814cccc968 Revert "Automated submodule update: kineto (#79925)"
This reverts commit cc0f1cc3d3.

Reverted https://github.com/pytorch/pytorch/pull/79925 on behalf of https://github.com/malfet due to Seems to have caused CUDA-10.2 regression, see https://hud.pytorch.org/hud/pytorch/pytorch/master/1?name_filter=linux-bionic-cuda10.2
2022-07-06 22:14:13 +00:00
Facebook Community Bot
cc0f1cc3d3 Automated submodule update: kineto (#79925)
This is an automated pull request to update the first-party submodule for [pytorch/kineto](https://github.com/pytorch/kineto).

New submodule commit: a7c85d503c

Test Plan: Ensure that CI jobs succeed on GitHub before landing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/79925
Approved by: https://github.com/malfet, https://github.com/robieta
2022-07-06 16:59:31 +00:00
PyTorch MergeBot
b1943e01e2 Revert "[MPS] Add test consistency from OpInfo based tests from PR 78504 (#79532)"
This reverts commit c71886e048.

Reverted https://github.com/pytorch/pytorch/pull/79532 on behalf of https://github.com/malfet due to Unintended submodules updates
2022-06-30 16:37:11 +00:00
PyTorch MergeBot
1454515253 Revert "Enable Intel® VTune™ Profiler's Instrumentation and Tracing Technology APIs (ITT) to PyTorch (#63289)"
This reverts commit f988aa2b3f.

Reverted https://github.com/pytorch/pytorch/pull/63289 on behalf of https://github.com/malfet due to broke trunk, see f988aa2b3f
2022-06-30 12:49:41 +00:00