In deep learning models, the tanh (hyperbolic tangent) function is a widely used activation function, primarily in feedforward networks, recurrent neural networks (RNNs), and various other architectures.
Also, the tanh (hyperbolic tangent) function is commonly used in **Physics-Informed Neural Networks (PINNs).** PINNs are a class of machine learning models designed to solve partial differential equations (PDEs) by incorporating the governing physics directly into the loss function, along with data-driven terms.
In PINNs, activation functions like tanh are used in the neural network architecture to enable the model to learn complex mappings between inputs (such as spatial and temporal coordinates) and outputs (such as field variables).
**Operator: tanh()**
**Current Implementation in OSS in ATen Backend:**
**SVE Flow:** Uses SVE sleef when available else std implementation.
**With this PR :**
**SVE Flow:** Uses SVE ACLE implementation. (Faster Implementation)
**Here are the performance improvements.**
**Single core perf numbers:**

**Metric:** CPU time avg time per iteration (In ms)
As you can see with both gcc and clang compilers, we see a significant performance gain with SVE ACLE implementation over current OSS Implementation (Sleef) and also Neon.
**Hardware:** m7g.8xlarge (Graviton 3 Instance)
**Script used in benchmarking:**
```python
import os
#os.environ["ATEN_CPU_CAPABILITY"] = "default"
os.environ["ATEN_CPU_CAPABILITY"] = "sve256"
import torch
import torch.nn as nn
#Set the random seed for reproducibility
torch.manual_seed(1)
#Create a tensor of shape (8521, 50)
x = torch.randn(8521, 50)
for i in range(10):
output = x.tanh()
#Perform the tanh operation 1000 times and profile the performance
print("### CPU tanh")
with torch.autograd.profiler.profile(record_shapes=True) as prof:
for i in range(1000):
output = x.tanh()
#Print the profiling results sorted by self CPU time
print(prof.key_averages().table(sort_by="self_cpu_time_total"))
#Optionally print the final output (if needed, uncomment the following line)
print(output)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143741
Approved by: https://github.com/malfet
Changes in this PR:
1. Add `is_structseq` and `is_structseq_class` functions to determine a object or a class is PyStructSequence.
2. Add a generic class `structseq` which can be used as the registration key for PyStructSequence types like `namedtuple` for Named Tuple types.
3. Change `is_namedtuple` to accept subclasses of namedtuple to be namedtuple. Before this PR, only namedtuple class directly created by `collections.namedtuple` or `typing.NamedTuple` were namedtuple classes while their subclasses were not. This PR makes `is_namedtuple` return true for subclasses of namedtuple class.
Resolves#75982. New tests are included in this PR.
- #75982
Pull Request resolved: https://github.com/pytorch/pytorch/pull/113257
Approved by: https://github.com/zou3519
Summary:
dynamo_compile for the most part has been accounting for compile time except autotuning.
all_compilation_types had earlier been injected on fx_codegen_and_compile, which was incorrect.
Add autotuining to dynamo and deprcate all_compilation_types counter.
Differential Revision: D72145447
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150293
Approved by: https://github.com/masnesral, https://github.com/jamesjwu
Per title, this version uses symm mem input both as input source and as a work buffer, so input is modified after the end (similar to what fbgemm car reduction does). It is intended to be wrapped in an op that would first copy the real inputs to symm mem buffers that wouldn't be exposed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150153
Approved by: https://github.com/xw285cornell
We found that the `pip install cmake` and `conda install cmake` has different behavior.
The reason is that the pip installed one doesn't find the corresponding libs under conda env. So we need to set the `CMAKE_PREFIX_PATH` for alignment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148863
Approved by: https://github.com/CuiYifeng, https://github.com/malfet
Co-authored-by: Cui, Yifeng <yifeng.cui@intel.com>
Fix https://github.com/pytorch/pytorch/issues/148639.
Summary:
Optimize the heuristics of parallel reduction: When the number of steps of the first inner loop beyond the maximum parallel depth is much larger than the number of steps of all outer loops within the maximum parallel depth, change the starting depth of parallelism to the first inner loop and recalculate the maximum parallel depth. I ran the Inductor benchmark with this PR on CPU. A timm model poolformer_m36 BF16 has about 25% performance improvement, and no performance regression is seen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149614
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
On Intel Max 1550, models like Super_SloMo can actually pass accuracy test after set deterministic, because we do not use atomic in upsampling bilinear backward in some cases when running on XPU. Furthermore, I guess the only reason not to set deterministic on these models is just avoiding errors. We should use warn_only = True.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149028
Approved by: https://github.com/guangyey, https://github.com/desertfire
Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Fixes#143071
Operations performed on tensors with `requires_grad=True` such as
```python
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
```
and
```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.pow(x,3)
```
are valid operations.
While an operation using `numpy` like
```python
import numpy as np
x = torch.tensor(2.0, requires_grad=True)
y = np.pow(x,3)
# > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```
leads to an error.
However, an operation that uses `math` like
```python
import math
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
```
does not cause an error, and `y` is no longer a tensor with a gradient!
This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models.
To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with:
```python
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
# > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
# Consider using tensor.detach() first.
```
Please let me know if you have any questions 👍
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261
Approved by: https://github.com/malfet
Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Previously, scaled_mm's (FP8 matmul) Triton lowering for inductor was in a separate template. This PR consolidates that lowering into the mm template, with an added epilogue to deal with multiplying the scales. This paves the way for future scaled variants of BMM, Grouped GEMM in inductor.
Currently, there is still a separate template for TMA+persistent version of scaled_mm. The current mm lowering has a separate template for TMA + Persistent version. Will hopefully consolidate the extra scaled_mm TMA+persistent template when the consolidation for the mm template is done.
TODO: Consolidate TMA+Persistent logic into 1 template and remove separate scaled_mm TMA template
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150045
Approved by: https://github.com/drisspg
Apparently the magical incantation to name these correctly lies in the
build_variant variable otherwise it silently does nothing.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150310
Approved by: https://github.com/atalman
As is the case with many inductor tests, this test adapts test criteria based on device type, where it should be adjusting for the backend registered for that device.
In this particular case, using the upstream triton CPU backend would lead to failures, as reference_in_float would be true as this is required for the C++/OpenMP backend which does not have float16 support. However most triton backends do, and as such should be tested in float16. Similarly a triton backend with a device not described as a GPU would get skipped from testing entirely.
A more generic solution would be ideal, but this would require a lot of work across many tests.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146911
Approved by: https://github.com/masnesral
`tuple[int]` means only a tuple of length 1, which is not what was intended.
```python
loss = torch.masked.mean(loss, mask=mask, dim=(-1, -2)) # Argument of type "tuple[Literal[-1], Literal[-2]]" cannot be assigned to parameter "dim" of type "DimOrDims"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149870
Approved by: https://github.com/Skylion007
This PR fixes a bug with how include directories with spaces are handled on Windows. I ran into an edge case with torch.compile() - it will error out with an exception on Windows. In particular, it will try to execute the following: `cl /I C:/Program Files/Python311/Include ...`, where `C:/Program` will be treated as separate from `Files/Python311/Include`.
I looked into using something like `shlex.quote` or `pathlib.Path`, but I didn't find those options to be suitable (shlex is POSIX shell only, pathlib.Path does not escape spaces).
There is another place in the function that also deals with escaping spaces. My fix follows the same style. 0ff2e6a85a/torch/_inductor/cpp_builder.py (L1464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148271
Approved by: https://github.com/Skylion007
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>