Move `from_ivalue` and `to_ivalue` and their dependents `StableIValueBoxedKernel`, `aoti_torch_library_impl` `aoti_torch_call_dispatcher` into new (non-aoti shim_common.cpp)
This is in prep for the above PRs where I add v2s (`torch_call_dispatcher` and `torch_library_impl`) that are versioning aware
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166373
Approved by: https://github.com/janeyx99
ghstack dependencies: #164356
As per title.
It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`.
Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166071
Approved by: https://github.com/ngimel
Summary:
When dealing with a large memory trace, the resulting plot can be challenging to interpret and analyze.
This commit introduces a feature that enables filtering of allocations that have already been freed, providing a more focused view.
The remaining events in the plot often warrant closer examination, as they may be indicative of potential out-of-memory (OOM) issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165752
Approved by: https://github.com/zdevito
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.
Changes:
1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
This PR enables additional Inductor unit tests for Intel GPU. Due to the increased number of test cases, the number of runners has been extended from 8 to 12 to prevent CI timeouts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166047
Approved by: https://github.com/jansel
Co-authored-by: Deng, Daisy <daisy.deng@intel.com>
Co-authored-by: Jason Ansel <jansel@jansel.net>
Fixes#100842
Disable jiterator for complex tan and tanh kernels due to accuracy issues, matching the existing approach used for acos, acosh, asin, and asinh. Reverts to thrust implementation which provides better numerical accuracy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165250
Approved by: https://github.com/ezyang
Fixes#161366
All the 4 types of dimension matrix are supported.
2d-2d, 2d-3d, 3d-3d, 3d-2d. The corresponding test cases in test_matmul_cuda are working
for both forward and backward pass.
The CK path is enabled for gfx942, gfx950.
ToDo: Need to enable support on gfx90a since the ck kernel used in this commit produces gpu error,
might require a different CK kernel config, based on the profiler result on gfx90a.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166334
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
# Context
Previously, we would modify the parent process's NUMA bindings in order to force child process to inherit them.
However, this would not work correctly if `start_method="forkserver"`, because the subprocesses would actually inherit their bindings from the forkserver middleman process. In this case, the inherited affinity would actually be incorrect for all but the first subprocess (because the forkserver process would get created lazily, and hence inherit and then stick with the bindings intended for the first subprocess).
# This PR
* `str` entrypoints: Use `numactl` CLI
* `Callable` entrypoints: Wrap the `Callable` entrypoint and call `os.sched_setaffinity` inside it.
Hopefully this will be the last necessary iteration.
# Test Plan
## Automated
`$ pytest test/test_numa_binding.py`
## Manual
Verified flops/sec and memory locality wins on several different types of jobs
* `Callable` with forkserver
* `str` entrypoint with spawn
* `Callable` entrypoint with spawn
More details in [this doc (Meta-only).](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.scjv58yswi64)
# Later PR
Update all the documentation when we're confident this has stabilized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166026
Approved by: https://github.com/d4l3k
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
# Motivation
In https://github.com/pytorch/pytorch/pull/145591, `std::hardware_destructive_interference_size` was introduced in CUDACachingAllocator. Later, https://github.com/pytorch/pytorch/pull/160067 moved it to `c10/core/alignment.h` for code reuse.
However, on **GCC 13+** using `std::hardware_destructive_interference_size` triggers the following warning:
```bash
warning: use of ‘std::hardware_destructive_interference_size’ [-Winterference-size]
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: its value can vary between compiler versions or with different ‘-mtune’ or ‘-mcpu’ flags
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: if this use is part of a public ABI, change it to instead use a constant variable you define
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: the default value for the current CPU tuning is 64 bytes
/home/pt-gpu/4T-4652/guangyey/stock-pytorch/aten/src/ATen/core/CachingHostAllocator.h:42:16: note: you can stabilize this value with ‘--param hardware_destructive_interference_size=64’, or disable this warning with ‘-Wno-interference-size’
```
# Solution
- Solution 1: Replace `c10::hardware_destructive_interference_size` with a constant 64.
```cpp
constexpr std::size_t hardware_destructive_interference_size = 64;
```
- Solution 2: adding `-Wno-interference-size’ to 8d4e48831e/cmake/public/utils.cmake (L386) to suppress the warning.
# Additional Context
The current implementation uses the second approach. If the reviewers prefer the first approach, I am happy to update it accordingly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166297
Approved by: https://github.com/ezyang
It is recommended to use `python -m pip install --no-build-isolation .` instead of `pip3 install --no-build-isolation .` because most of us use a virtual environment, and the latter probably relies on the system `pip3` rather than the conda or uv. We need to make it consistent with the Python we use, and it is also consistent with how `torch` is installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166235
Approved by: https://github.com/fffrog, https://github.com/ezyang
when the slice is tensor, we decompose it to .item() call and pass the unbacked symbol to the slice to avoid DDE.
the diff also fix an existing bug in codegen_dynamic_slice_size in the cpp wrapper. a +1 should be -1 making it match
python codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165074
Approved by: https://github.com/Lucaskabela
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
Summary:
Previously we didn't correctly handle closure tuple when there's content in it. Adding additional code for serializing the tuple and merge it with guard manager local scope.
Test Plan:
pytest test/dynamo/test_aot_compile.py
Reviewers:
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166351
Approved by: https://github.com/Lucaskabela
This pull request introduces a standardized YAML-based configuration system for transformer attention benchmarks, making it easier to run and manage comprehensive performance tests. It adds example configs, and a wrapper script to convert YAML configs into CLI arguments for the benchmark runner.
#### Next Steps:
CI Enablement: This change would further lead to running the attention ops in CI for regression tracking.
#### Developer flow: (Run locally)
`python score_mod.py --config configs/config_test.yaml`
#### Enabling CI run: https://github.com/pytorch/pytorch/pull/165915
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164155
Approved by: https://github.com/jbschlosser