Commit Graph

15267 Commits

Author SHA1 Message Date
Ke Wen
effc545274 [DDP] Use NCCL allocated memory for gradient bucket (#146589)
So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications.
Less SM usage, less memory contention between NCCL kernel and compute kernels.

Added env `DDP_DISABLE_COMM_MEM` as a back-out option:
```
An environment variable to disable comm-optimized memory pool.
Default is 0, which means comm-optimized memory pool is enabled.
Users can set it to 1 in case of seeing regression or OOM (because this
comm MemPool may not share space with regular compute MemPool).
```

Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589
Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj
2025-02-10 05:23:11 +00:00
Simon Fan
387c993c3b [ca] remove private API: _compiled_autograd_should_lift (#146720)
Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720
Approved by: https://github.com/zou3519
2025-02-10 04:29:57 +00:00
Dingming Wu
fa34128435 revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453)
Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass.

Test Plan:
With the change, build of APS model using rcclexp can now pass:
`sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0`

Reviewed By: c-p-i-o

Differential Revision: D69149588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453
Approved by: https://github.com/c-p-i-o
2025-02-07 22:43:52 +00:00
Shunting Zhang
bc0191802f [inductor] add size-asserts for fallback ops (#145904)
Fix https://github.com/pytorch/pytorch/issues/144717

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904
Approved by: https://github.com/jansel
2025-02-07 18:44:32 +00:00
Tristan Rice
68631f6e87 PyWork: preserve Python reference counting when used in functional collectives (#146376)
@fegin  found an issue where torchft is not compatible with functional collectives.

Found in https://github.com/pytorch/torchtitan/pull/806

The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug.

PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists

To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects.

To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup.

Test plan:

```
cd pytorch
pytest test/distributed/test_c10d_functional_native.py
```

```
cd torchft
pytest torchft/process_group_test.py -k functional -v -x -s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376
Approved by: https://github.com/yifuwang
2025-02-07 18:07:53 +00:00
cyy
25aa7ca62d Cleanup CallOnce.h (#146700)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700
Approved by: https://github.com/albanD
2025-02-07 16:44:45 +00:00
cyy
fa0592b568 Remove some NOLINT (#146610)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-02-07 01:50:06 +00:00
Michal Gallus
9ea1823f96 [ROCm][Windows] Remove external linkage from an anonymous namespace (#146607)
Fixes a clang-cl compiler error related to attempt to export a symbol that doesn't have any external linkage, since its declared within a local anonymous namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146607
Approved by: https://github.com/jeffdaily
2025-02-06 23:48:20 +00:00
Michael Suo
99dd846672 [torch] fix builds for older pybind (#146630)
Summary:
some versions of pybind we build with don't have `py::set_error`.

So just use the underlying python C API.

Test Plan: unit tests

Differential Revision: D69254629

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146630
Approved by: https://github.com/colin2328, https://github.com/ngimel
2025-02-06 21:22:00 +00:00
Eddie Yan
9ee506bd93 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-02-06 19:04:50 +00:00
Michael Suo
425804db2b [torch] fix exception types in custom class magic setattr/getattr (#146516)
Summary:
`c10::AttributeError` is not automatically converted to Python AttributeError, it needs some special macros (e.g. `HANDLE_TH_ERRORS`).

Some Python functions like `hasattr` rely on the type of the throw exception to be correct.

We don't need the fully generality of those macros, so just do a targeted error type conversion here.

Test Plan: added unit test

Differential Revision: D69197217

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146516
Approved by: https://github.com/zdevito
2025-02-06 02:14:11 +00:00
Simon Fan
72405b0c0f [ca] refactor compile reasons and log to tlparse (#146386)
This PR accumulates comple reasons inside each CacheNode, and logs them to tlparse on each CA compile. This defines a compile as an autograd structure change, and a recompile as a dynamic shape change.

sample tlparse: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpdbo7gt/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=100

for compiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]"
]
```

for recompiles:
```python
[
  "!0: Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]",
  "!1: Cache miss due to 7 changed tensor shapes (total of 7): sizes[0], sizes[1], sizes[2], sizes[3], sizes[4], sizes[5], sizes[6]"
]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146386
Approved by: https://github.com/jansel
ghstack dependencies: #146229
2025-02-05 23:33:21 +00:00
Zhengxu Chen
cd6c0707a8 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-05 15:43:05 +00:00
Simon Fan
e20b0c82d1 [ca] no longer require is_traceable annotations for c++ autograd functions (#146229)
This PR removes the CA compile-time error for C++ autograd functions, and supports them by having dynamo graph break on them (instead of allow_in_graph). The CppNode's collects are kept as is for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146229
Approved by: https://github.com/jansel, https://github.com/zou3519
2025-02-05 08:49:17 +00:00
cyy
6293d1446b [2/N] Remove NOLINT suppressions (#146402)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146402
Approved by: https://github.com/soulitzer
2025-02-05 08:38:52 +00:00
Benjamin Glass
e2a029054d cpp_wrapper: enable all CPU repro tests (#145655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145655
Approved by: https://github.com/desertfire
ghstack dependencies: #145095, #145654
2025-02-04 22:05:59 +00:00
Benjamin Glass
9873319a42 cpp_wrapper: fix set_.source_Tensor lowering (#145654)
Adds a C-shim fallback for `set_.source_Tensor`, which is effectively required by `ir.SetSourceTensorKernel`. As a necessary prerequisite to use that IR node, updates `CppWrapperCpu` to handle in-place returns in C-shim ops (the arguments for those returns are silently dropped by `torchgen`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145654
Approved by: https://github.com/desertfire
ghstack dependencies: #145095
2025-02-04 22:05:59 +00:00
Benjamin Glass
7c0fe7a045 cpp_wrapper/aot_inductor: handle conjugation and negation dispatch keys (#145095)
Handles conjugation and negation in the same way that runtime dispatch does: by on-the-fly cloning a tensor with either key applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145095
Approved by: https://github.com/desertfire
2025-02-04 22:05:58 +00:00
PyTorch MergeBot
106acf0eec Revert "[aoti] Assign proxy call args by name, and support default values. (#146263)"
This reverts commit 11f69808c6.

Reverted https://github.com/pytorch/pytorch/pull/146263 on behalf of https://github.com/atalman due to multiple build failures, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146263#issuecomment-2633828689))
2025-02-04 12:57:55 +00:00
Brian Hirsh
e68f5087d8 update _unsafe_set_version_counter to accept lists of tensors (#137921)
See the comment [here](https://github.com/pytorch/pytorch/issues/132014#issuecomment-2379547400) (cc @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @aakhundov @XilunWu @rec) - this PR updates `_unsafe_set_version_counter` to accept a list of tensors, for overhead-sensitive users (e.g. distributed) who need to hide VC bumps from autograd on a large list of tensors without wanting to suffer the overhead of going from python->C++ separately for every tensor in the list.

I left the binding in pybind, and used a `std::vector`. if we **really** need to optimize overhead even further, we could write a manual cpython binding.

I use this updated API in the next PR to fix FSDP2, so that it properly hides the VC of all `all_gather_buffer` tensors in its call to `split_with_sizes_copy.out(all_gather_buffers)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137921
Approved by: https://github.com/awgu, https://github.com/albanD
2025-02-04 04:51:11 +00:00
Sheng Fu
425aca40a4 Fix random crash in PyPer (#146327)
Summary: PyPer saw random crashes when writing into ET file. This DIFF is to check if the output file is in condition before writing into it, and catch the exception if something bad happens, instead of crashing.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA

Differential Revision: D69065509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146327
Approved by: https://github.com/sraikund16
2025-02-04 04:50:40 +00:00
angelayi
0c37c332da [export] Additionally save pytree namedtuple field names (#145956)
If a user passes in a namedtuple as an input, currently the input TreeSpec looks like: `TreeSpec(type=namedtuple, context=”class_fqn”, children_spec=[*, *])`

The user then saves the program containing this input TreeSpec. But what happens if they load it in a new environment where `class_fqn` now contains an additional field?

This means that the exported program is now expected to take in another input. But since those fields were not used in the original program, users should be able just drop those additional fields and the program will run successfully. This is needed/used in APS where they use unflattener's adapter to adapt the inputs based on the previously saved treespecs.

There are a couple of [solutions](https://docs.google.com/document/d/1V4ZSdy-8PUISWc8RqvGu3DU01BVegJhHHPWqa1Io7Eg/edit?tab=t.0) for how we can address this, but eventually we settled on saving a side table mapping namedtuple types to their list of field names, which can then be accessed by the adapter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145956
Approved by: https://github.com/zhxchen17
2025-02-04 04:42:30 +00:00
cyy
f397c72697 Remove NOLINTNEXTLINE (#146238)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146238
Approved by: https://github.com/albanD
2025-02-04 02:45:32 +00:00
PyTorch MergeBot
00dc5b10f6 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 2fd1b6b361.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/atalman due to Breaks executorch tests ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2632202864))
2025-02-03 22:04:28 +00:00
Zhengxu Chen
11f69808c6 [aoti] Assign proxy call args by name, and support default values. (#146263)
Fixing the following issue when compiling the following program:
```
                window = torch.hann_window(N_FFT).to(x.device)
                stft = torch.stft(
                    x, N_FFT, HOP_LENGTH, window=window, return_complex=True
                )
                magnitudes = stft[..., :-1].abs() ** 2
                return magnitudes
```
```
Traceback (most recent call last):
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
    yield
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
    self._callTestMethod(testMethod)
  File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
    method(*args, **kwargs)
  File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
    return value(self)
           ^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
    self.check_model(model, example_inputs)
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
    actual = AOTIRunnerUtil.run(
             ^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
    optimized = AOTIRunnerUtil.load(device, so_path)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
    return torch._export.aot_load(so_path, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
    runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device)  # type: ignore[assignment, call-arg]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
2025-02-03 20:15:59 +00:00
Isuru Fernando
08b14936ae Disable has_relational_guards check for dict_tag optimization for now (#146232)
has_relational_guards evaluates to true almost always, and leads to a
slowdown in guards runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146232
Approved by: https://github.com/anijain2305
2025-02-03 07:56:06 +00:00
Zhengxu Chen
1580f47bf4 [export][ez] Fix generated header file. (#146208)
Summary: as title.

Test Plan: CI

Differential Revision: D68978788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146208
Approved by: https://github.com/yiming0416
2025-02-03 06:01:05 +00:00
Davide Italiano
d80eef7c6d [inductor] Guard a member variable with a define. (#146278)
It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind:

/Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field]
  412 |   size_t blob_size_;
      |          ^
1 warning generated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278
Approved by: https://github.com/Skylion007, https://github.com/jansel
2025-02-03 02:20:08 +00:00
Nikita Shulga
e56dcf2772 [CPUInductor] Fix SVE256 detection (#146207)
This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()`

I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256

Fixes https://github.com/pytorch/pytorch/issues/145441

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207
Approved by: https://github.com/angelayi
2025-02-01 18:51:34 +00:00
cyy
2fd1b6b361 [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-02-01 12:33:41 +00:00
fduwjj
eb029fba13 [c10d][NCCL] Implement ncclCommInitRankScalable (merging #136789) (#144794)
Try to land https://github.com/pytorch/pytorch/pull/136789/files on our end and fix any remaining issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144794
Approved by: https://github.com/kwen2501, https://github.com/eqy, https://github.com/atalman
2025-01-31 22:39:56 +00:00
William Wen
5b1abdbf5d [dynamo] remove always-failing eval_frame.c debug check (#145982)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145982
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #145981
2025-01-31 20:40:59 +00:00
Zhengxu Chen
aad9f44b2e [export] Sync model container types to schema.py (#145959)
Summary: Synced from D68840230

Test Plan: No behavior changes to existing API. Will be tested internally.

Differential Revision: D68846532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145959
Approved by: https://github.com/yiming0416
2025-01-31 18:17:56 +00:00
PyTorch MergeBot
c3f71eb61b Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit e2917245fb.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally with the same error.  @Chillee or @malfet, can you please help the change get tested? (See D68783351) ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2627886999))
2025-01-31 17:43:09 +00:00
Mikayla Gawarecki
001e355a56 Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## How does this work

The format for the checkpoint is as such

```
archive_name/
|_ data.pkl
|_.format_version
|_byteorder
|_data/
  |_ 0
  |_ 1
  |_ 2
  |_ ...
|_
```

Each `data/i` record represents a storage, where storages are written in the order that the Pickler encounters them.

For each storage, our `persistent_load` logic saves the following metadata to the pickle file `dtype, numel, key, location` where `numel` is the number of bytes in the storage.

Note that we always use `miniz` writer  in the zip64 mode per [here](7796e308d0/caffe2/serialize/inline_container.cc (L701)) A zipfile record written by miniz looks as such

```
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
| 30 byte header | n byte filename | zip64_extra_data | m byte padding | storage | 16 or 24 byte local dir footer  |
 ---------------- ----------------- ------------------- ---------------- --------- ------------------------------
```

- The header size (30) is given by [`MZ_ZIP_LOCAL_DIR_HEADER_SIZE`](https://github.com/pytorch/pytorch/blob/main/third_party/miniz-3.0.2/miniz.c?fbclid=IwZXh0bgNhZW0CMTEAAR2O8Vysd--UoSCxW70gabXIS1dbz733oHwuUQ5_Ff1hY2WU6PL2i6CSH4A_aem_J9oaU2HpDeWtJKOU9EnVqw#L3290)
- filename will be `"{archive_name}/{filepath}"`

- `zip64_extra_data` is determined by [`mz_zip_writer_create_zip64_extra_data`](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6202)). Note that [we only create zip64_extra_data if storage_size >= 0xFFFFFFFF or the offset of the start of the header >= 0xFFFFFFFF](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6519-L6524))
- `m` is determined by [`getPadding`](7796e308d0/caffe2/serialize/inline_container.cc (L254)), which accounts for filename, zip64_extra_data to determine `m` such that the start of `storage` is aligned to 64 bytes. The `m` bytes will always start with `F B padding_size" as the first 4 bytes
- The local dir footer size is determined based on [this snippet ](7796e308d0/third_party/miniz-3.0.2/miniz.c (L6610-L6632)): if the buffer size is 0 it is skipped. If the zip64_extra_data was created, it is 24, otherwise it is 16.

When `torch.utils.serialization.config.load.calculate_storage_offsets` is set we do the following
- We keep track of where the "cursor" is in the file using `current_offset`, after each persistent_load call, it will be at the offset where the header for the next record starts
- for the 0th storage, "data/0", we use the regular get_record_offset to determine the start of the storage
- for any other storage, (where the storages will be in order encountered by the unpickler, 0, 1, 2, 3, ...) we use `get_record_offset_no_read`, which re-uses the `getPadding` logic to determine the offset of the storage
- Note that `load_tensor` will only ever be called again with the same key if the storage's `._data_ptr()` is 0 [[pointer1](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1917-L1918)][[pointer2](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L1936-L1937)], so we cache the offsets for this edge case
- After each storage, if the storage is non-zero, we account for the local dir footer based on the logic described above

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-31 17:09:20 +00:00
Bin Bao
2af876707b [AOTI] Fix a memory leak in package boxed_run (#146100)
Summary: AOTIModelPackageLoaderPybind::boxed_run missed a decref when constructing the returned py::list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146100
Approved by: https://github.com/cpuhrsch
2025-01-31 13:32:28 +00:00
Manav Avlani
f9227e7c33 Expose ToIValueAllowNumbersAsTensors to TORCH_PYTHON_API so we can use it in monarch (#146087)
Summary: TSIA

Test Plan: Tested up the stack but existing unittests

Reviewed By: suo

Differential Revision: D68917233

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146087
Approved by: https://github.com/suo
2025-01-31 05:08:11 +00:00
Yifu Wang
c70362fac8 [AsyncMM] re-enable and adapt to cutlass 3.6.0 (#144011)
[D68734067](https://our.internmc.facebook.com/intern/diff/D68734067)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-31 00:48:51 +00:00
Ke Wen
9fdc20809a [PGNCCL] Simplify support macro definition (#145964)
- Promotes usage of `NCCL_VERSION_CODE >= NCCL_VERSION(X, Y, Z)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145964
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
ghstack dependencies: #145893
2025-01-30 23:26:32 +00:00
Eddie Yan
e2917245fb [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee, https://github.com/malfet
2025-01-30 22:33:50 +00:00
Bin Bao
16420a78eb [AOTI] Remove AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 (#146039)
Summary: The AOTI_USE_CREATE_TENSOR_FROM_BLOB_V1 macro was used to solve a FC issue and it can be removed now.

Test Plan: CI

Differential Revision: D68871245

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146039
Approved by: https://github.com/yushangdi, https://github.com/hl475
2025-01-30 19:01:19 +00:00
Ke Wen
51ee9b154e [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-30 18:19:00 +00:00
PyTorch MergeBot
5fa28bbe40 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 18a7a04c4a.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to Sorry but this still fails internally. See D68866823 for details ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2624900562))
2025-01-30 16:01:52 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
Ke Wen
25ca05eebf [PGNCCL] Correct some ifdef's (#145893)
`create` function supporting `ncclConfig_t` should be wrapped inside `NCCL_HAS_CONFIG` instead of `NCCL_HAS_COMM_NONBLOCKING`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145893
Approved by: https://github.com/c-p-i-o
2025-01-30 01:05:21 +00:00
Ke Wen
18a7a04c4a [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 23:20:22 +00:00
bglass@quansight.com
40ccb7a86d cpp_wrapper: Move #includes to per-device header files (#145932)
Summary:
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Reland https://github.com/pytorch/pytorch/pull/143909 after merge conflicts.

Co-authored-by: Benjamin Glass <[bglass@quansight.com](mailto:bglass@quansight.com)>

Differential Revision: D68656960

Pulled By: benjaminglass1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145932
Approved by: https://github.com/yushangdi, https://github.com/benjaminglass1

Co-authored-by: bglass@quansight.com <bglass@quansight.com>
2025-01-29 21:08:45 +00:00
PyTorch MergeBot
6371c25b91 Revert "[c10d] Add NCCL memory allocator (#145675)"
This reverts commit 9fd6722fc9.

Reverted https://github.com/pytorch/pytorch/pull/145675 on behalf of https://github.com/ZainRizvi due to This fails to build internally, can you please take a look at D68831004 for more details? ([comment](https://github.com/pytorch/pytorch/pull/145675#issuecomment-2622515425))
2025-01-29 18:30:30 +00:00
PyTorch MergeBot
284f217011 Revert "[Environment Variable][7/N] Use thread-safe getenv functions (#140211)"
This reverts commit 97b3b73f3e.

Reverted https://github.com/pytorch/pytorch/pull/140211 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @eqy @ezyang can you please help this get remerged? See D68779772. ([comment](https://github.com/pytorch/pytorch/pull/140211#issuecomment-2622504898))
2025-01-29 18:24:29 +00:00
rpsilva
4abff4b271 Introduce cache clearing APIs for the lazy graph executor (#144489)
This PR introduces two new methods to the LazyGraphExecutor class:

- ClearComputationCache(): Allows clearing the entire computation cache.
- RemoveFromComputationCache(hash): Enables removal of specific cache entries based on their hash.

The main objective is to expose cache management functionality for debugging cache hits and misses across different computations. For instance:
- Reset the cache state in tests, allowing reuse of the same computation client to evaluate cache logic consistently.
- Selectively remove cache entries to analyze the impact on subsequent computations.
- Improve observability into the cache behavior, aiding in the investigation of cache-related issues or optimizations.

On the XLA lazy graph executor, we want to run a series of tests that modify some parts of the HLO module proto of the computation, and we need a means to ensure that the hash is agnostic to some elements (OpMetadata in the XLA proto data). Hence, it would be easy to parameterize the test, clear the cache and validate that the resulting hash is the same between runs. Otherwise, we'd need to hardcode the resulting serialized hash.

Simultaneously, **another motivation**, is that users could also clear some computation hashes for an added flexibility in their applications, by introducing their own custom strategies for maintaining the cache (without relying on the default LRU).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144489
Approved by: https://github.com/wconstab
2025-01-29 17:38:01 +00:00
Ke Wen
9fd6722fc9 [c10d] Add NCCL memory allocator (#145675)
This PR implements a small UI improvement over #133603.

It prepares a NCCL memory allocator in torch cpp and then pybind's it out, so that user can directly use it.

UI:
```
pool = torch.cuda.MemPool(backend.mem_allocator)
with torch.cuda.use_mem_pool(pool):
    tensor = torch.arange(1024 * 1024 * 2, device=device)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145675
Approved by: https://github.com/syed-ahmed, https://github.com/wconstab
2025-01-29 02:48:56 +00:00
fduwjj
4f949f282d [c10d][ez] Remove goto in PGNCCL and make linter happy for PGNCCL and NCCLUtils (#145855)
While working on PGNCCL I found that the code triggers some lint warnings so this PR is to address them or add lint suppressor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145855
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2025-01-28 21:19:49 +00:00
cyyever
ef28df5c9e [Reland][Environment Variable][4/N] Use thread-safe getenv functions (#140593)
Reland of #137843 , after checking the code again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140593
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-01-28 20:51:49 +00:00
PyTorch MergeBot
3481c2aec4 Revert "[dynamo] save/restore system random state more carefully (#145750)"
This reverts commit e3d3f2b22e.

Reverted https://github.com/pytorch/pytorch/pull/145750 on behalf of https://github.com/eellison due to bisected perf regression ([comment](https://github.com/pytorch/pytorch/pull/145750#issuecomment-2620028414))
2025-01-28 20:51:07 +00:00
Mu-Chu Lee
ac87388e61 [AOTInductor] Refactor CPU and GPU to remove ifdef macros (#145639)
Summary: Remove #ifdef USE_CUDA macros through some refactor

Test Plan: Refactor code, existing tests.

Differential Revision: D68636743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145639
Approved by: https://github.com/desertfire
2025-01-28 16:46:00 +00:00
cyyever
97b3b73f3e [Environment Variable][7/N] Use thread-safe getenv functions (#140211)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140211
Approved by: https://github.com/ezyang, https://github.com/eqy
2025-01-28 15:21:12 +00:00
Chirag Pandya
bdf6dfa17d [chore][ez] change alloc buffer size from 4000 to 4096 (#145759)
Summary:
Allocations typically happen as a power of 2 anyway.
Change the default alloc size to 4096 so eek out a bit more perf.

Test:
unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145759
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756, #145757
2025-01-28 09:14:07 +00:00
PyTorch MergeBot
9010649292 Revert "Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)"
This reverts commit db3685a35c.

Reverted https://github.com/pytorch/pytorch/pull/143880 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but either this PR or the base PR breaks distributed tests ([comment](https://github.com/pytorch/pytorch/pull/143880#issuecomment-2617743403))
2025-01-28 03:07:17 +00:00
Chirag Pandya
78f02bf07c [bug] handle case when remote peer closes connection (#145757)
Summary:
In the case where remote peer closes the connection, nread returns 0. In
this case, we still want to free up the allocated buffer.
Also, reorder the if so that the likely success cases (nread > 0) is at
the top of the function with an early return.

Test Plan:
unit tests

Differential Revision: [D68733192](https://our.internmc.facebook.com/intern/diff/D68733192)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145757
Approved by: https://github.com/XilunWu, https://github.com/fduwjj
ghstack dependencies: #145756
2025-01-28 03:06:38 +00:00
Yifu Wang
db33d23aa8 [SymmetricMemory] fix an issue where rendezvous is performed with wrong device context when torch.cuda.set_device() is not callled (#144886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144886
Approved by: https://github.com/awgu
2025-01-28 01:43:37 +00:00
William Wen
e3d3f2b22e [dynamo] save/restore system random state more carefully (#145750)
Reattempt of https://github.com/pytorch/pytorch/pull/145435 since the state of the linked internal diff appears to be messed up.

Note: I have verified that the previously failing internal tests now pass internally.

Differential Revision: [D68723334](https://our.internmc.facebook.com/intern/diff/D68723334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145750
Approved by: https://github.com/StrongerXi
2025-01-28 01:34:13 +00:00
Mikayla Gawarecki
db3685a35c Add option to serialization config to reduce random reads from get_record_offset when loading with mmap=True (#143880)
## Background

This PR adds `torch.utils.serialization.config.load.calculate_storage_offsets`. This option relies  on the previous PR in this stack, where storage order was changed to non lexicographical. A `.format_version` entry was added to the zipfile and `calculate_storage_offsets` will only work on checkpoints with `.format_version`.

When this is turned on, for `torch.load(mmap=True)`, offsets of each storage record (other than the 0th storage will be calculated instead of relying on `miniz` APIs to determine this).

The existing APIs will issue multiple random reads (reading the end of central directory record, then reading the zipfile header for the record) to determine the storage offset where the record starts. This can greatly degrade `torch.load(mmap=True)` performance for non-filesystem cases.

6aaae9d78f/caffe2/serialize/inline_container.cc (L589-L605)

## Testing strategy

The agreed upon testing strategy was as follows:
- Add debug code gated by an environment flag `TORCH_SERIALIZATION_DEBUG` that will run this offset calculation logic and verify it against getRecordOffset for each storage (when mmap=False)
- This flag is set throughout CI, which means that every time `torch.load` is called, the offset calculation logic is implicitly being tested.

Differential Revision: [D67673026](https://our.internmc.facebook.com/intern/diff/D67673026)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143880
Approved by: https://github.com/albanD
ghstack dependencies: #143879
2025-01-27 23:57:30 +00:00
Chirag Pandya
3ce68dc61e [c10d] Flush file in file recorder (#145458)
Summary:
Flushing file to hopefully prevent file corruptions as reported in
https://github.com/pytorch/pytorch/pull/145125

Test Plan:
Couldn't get file corruption to occur in my tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145458
Approved by: https://github.com/kwen2501
2025-01-27 23:15:52 +00:00
Chirag Pandya
5534c270db [chore] fix new linter (#145756)
Summary:
Fix new linter that's complaining when I made changes to this file:
class 'LibUVStoreDaemon' defines a non-default destructor but does not
define a copy constructor, a copy assignment operator, a move
constructor or a move assignment operator

Test Plan:
make lint passes

Differential Revision: [D68733191](https://our.internmc.facebook.com/intern/diff/D68733191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145756
Approved by: https://github.com/XilunWu, https://github.com/Skylion007, https://github.com/fduwjj
2025-01-27 22:48:12 +00:00
PyTorch MergeBot
c986eba560 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit abf28982a8.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/ZainRizvi due to Sorry but this is failing internally. @Chillee can you please help change get remerged? See  D68720562 ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2616726406))
2025-01-27 19:38:26 +00:00
Eddie Yan
abf28982a8 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-27 18:05:23 +00:00
rzou
ea141d8134 functional compiled autograd (#144707)
This PR squashes together the following commits:

https://github.com/pytorch/pytorch/pull/144115
https://github.com/pytorch/pytorch/pull/143417
https://github.com/pytorch/pytorch/pull/143405
https://github.com/pytorch/pytorch/pull/143387
https://github.com/pytorch/pytorch/pull/143304
https://github.com/pytorch/pytorch/pull/143296

This is a refactor of compiled autograd to use "functional autograd". The end goal is that it gets compiled autograd's initial capture to stop specializing on Tensor metadata, therefore allowing compiled autograd to better handle Tensor subclasses.

For more information, please read the commit messages for each PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144707
Approved by: https://github.com/bdhirsh, https://github.com/xmfan, https://github.com/jansel
2025-01-27 05:20:56 +00:00
wengshiy
73622fc5fa Fix Throughputbenchmark issue (#144669)
Fixes [144461](https://github.com/pytorch/pytorch/issues/144461)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144669
Approved by: https://github.com/leslie-fang-intel, https://github.com/williamwen42, https://github.com/jansel
2025-01-26 03:37:20 +00:00
Sheng Fu
bbe7f53218 Save integral tensor data for ET (#144508)
Summary:
et_replay uses random data to run operators, however, the operators using index tensor to access memory won't work with random data. It usually ran into two exceptions: 1. illegal memory access since index is out of range, it has been fixed with the environment variable ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR_RANGE to record the min/max value of index tensors. 2. unaligned memory access, FBGEMM ops have speical requirements for the memory layout.

To fix the second execption, ENABLE_PYTORCH_EXECUTION_TRACE_SAVE_INTEGRAL_TENSOR is added to allow user to specify the node names, separated by comma, so ET will save the integral tensor data for these nodes. The saved data will be used in et_replay.

Be careful to turn on this option since it will use more space to save the extra data.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_data_cuda

Differential Revision: D67989856

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144508
Approved by: https://github.com/briancoutinho
2025-01-25 05:38:10 +00:00
Tan Hoang
5bf5ce0e15 Modify enable logic of COLLECTIVE_COMM profiler activity type (#145478)
Summary:
Since `KINETO_NCCL_PROFILER` flag is not used anymore (we are moving from linking the profiler during compile time to loading it dynamically), we change the logic for enabling the profiler to use `TORCH_PROFILER_ENABLE_COLLECTIVE_PROFILING` environment variable for NCCL Collective Communication Profiler.

For HCCL, we still keep the same logic

Test Plan: See  https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/gpu_traces/tree/traces/clientAPI/0/1737579474/devvm29927.cln0/nccl_activities_2387985.json.gz for sample trace on nccl-profiler

Differential Revision: D68515945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145478
Approved by: https://github.com/sraikund16
2025-01-24 22:21:09 +00:00
Yichen Yan
d4171b724e Let tensor_a.new_tensor() be on tensor_a.device by default (#144958)
Fixes #144957
Closes #73838 cc @albanD @ezyang

Currently, `tensor_a.new_tensor()` will return a on-cpu tensor no matter where is `tensor_a`. This differs from the document and is a side-effect of https://github.com/pytorch/pytorch/pull/41984.

See #144957 how current logic breaks dynamo.

This PR restore the documented behavior and add tests for `new_tensor`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144958
Approved by: https://github.com/ezyang
2025-01-24 22:12:31 +00:00
c8ef
a989a0b13a [NFC] Fix some minor typos. (#145599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145599
Approved by: https://github.com/Skylion007
2025-01-24 18:58:59 +00:00
Shuqiang Zhang
c0861d092c [PGNCCL] Add an API to get the status/error code at the PG level (#144498)
Summary:
This PR is basically a replacement of
https://github.com/pytorch/pytorch/pull/140087, which caused some perf
drop due to frequent TCPStore check in watchdog thread. The fix is to move the
tcpstore check in monitoring thread

If unhealthy, the user should be able to get the type of errors, e.g.,
timeout,nccl error or remote error.

This API is applied to PG level, compared to the
work.get_future_result() API which is applied to Work Level.
Error detection at PG level is much more convenient for users to handle
the PG failure as a whole, e.g, restarting the PG.

Error handling at the work level is still useful for users to attach
work specific context and debug the RC of the specific failing
work/collective

Note it is critical for all ranks in the PG to be notified about an
error as soon as it occurs, so we introduce an errorType of
REMOTE_ERROR, which is 'broadcasted' from a src rank (which detects a
local error) to all other ranks in the PG, the broadcast is done through
TCPStore currently

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144498
Approved by: https://github.com/kwen2501
2025-01-24 16:47:32 +00:00
cyy
6a35d9aaa4 Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-24 12:22:13 +00:00
Benjamin Glass
d5629889f1 cpp_wrapper: Properly handle scalars when input to tensor arguments (#144910)
Additionally, reduce code duplication in `cpp_wrapper_cpu_array_ref.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144910
Approved by: https://github.com/desertfire
2025-01-24 02:06:35 +00:00
PyTorch MergeBot
6a2b4db0a1 Revert "Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)"
This reverts commit 42f4fda2eb.

Reverted https://github.com/pytorch/pytorch/pull/143806 on behalf of https://github.com/huydhn due to Lots of builds fail after this land, so maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/143806#issuecomment-2611275836))
2025-01-24 00:17:34 +00:00
Chirag Pandya
f8a4f16634 [c10d] fix memory leak on shutdown (#145507)
Summary:
Fix memory leak on shutdown when socket is closed.
We still need to free the buffer to make valgrind happy.

Test Plan:
Use `mtiavm`.
Repro steps provided by cristianlume.

on window 1:
```
vm ssh --vm=0 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2
```
on window 2:
```
vm ssh --vm=1 -- $(buck run @//neteng/ai/rdma_gen/mode/owl //neteng/ai/rdma_gen:rdma_gen --emit-shell) --rdma_mode=mtiav1 --num_ranks=2 --rank=1 --store_host=172.16.1.1
```

without the fix:
```
==8766==ERROR: LeakSanitizer: detected memory leaks
```
With fix, no leak

Differential Revision: D68566104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145507
Approved by: https://github.com/XilunWu, https://github.com/d4l3k
2025-01-23 23:36:15 +00:00
PyTorch MergeBot
6dd8283381 Revert "[compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)"
This reverts commit 5531fafffe.

Reverted https://github.com/pytorch/pytorch/pull/143296 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
PyTorch MergeBot
c3fadacf84 Revert "[compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)"
This reverts commit 8c7c5f7bfc.

Reverted https://github.com/pytorch/pytorch/pull/143304 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
PyTorch MergeBot
9553301ade Revert "[compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)"
This reverts commit 784bb2127c.

Reverted https://github.com/pytorch/pytorch/pull/143387 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
PyTorch MergeBot
16c4f8c395 Revert "[compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)"
This reverts commit ec820fe57c.

Reverted https://github.com/pytorch/pytorch/pull/143405 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:13 +00:00
PyTorch MergeBot
3f6cfd0156 Revert "[compiled autograd] stop specializing on metadata during initial trace (#143417)"
This reverts commit 99dd1bf1b9.

Reverted https://github.com/pytorch/pytorch/pull/143417 on behalf of https://github.com/izaitsevfb due to breaking internal tests T213390054 ([comment](https://github.com/pytorch/pytorch/pull/143296#issuecomment-2611224926))
2025-01-23 23:34:12 +00:00
PyTorch MergeBot
bf62222d81 Revert "[compiled_autograd] Rename interface to pyinterface (#145495)"
This reverts commit e1407f5aeb.

Reverted https://github.com/pytorch/pytorch/pull/145495 on behalf of https://github.com/izaitsevfb due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145495#issuecomment-2611194932))
2025-01-23 23:07:17 +00:00
PyTorch MergeBot
dad9bc3461 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit de945d78da.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/izaitsevfb due to unused variables again :( ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2611182461))
2025-01-23 22:59:25 +00:00
cyy
42f4fda2eb Enable clang-tidy on torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (#143806)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143806
Approved by: https://github.com/kwen2501
2025-01-23 22:47:18 +00:00
Richard Zou
e1407f5aeb [compiled_autograd] Rename interface to pyinterface (#145495)
Summary: interface is a reserved word in some MSVC variants.

Test Plan: build

Differential Revision: D68561379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145495
Approved by: https://github.com/xmfan
2025-01-23 21:40:59 +00:00
Nikhil Gupta
41b38f755c Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392)" (#145505)
https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue.

1. This reverts commit 0940eb6d44 (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue.
2. KleidiAI is now cloned from github mirror instead of arm gitlab

Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2

Fixes https://github.com/pytorch/pytorch/issues/145273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505
Approved by: https://github.com/malfet
2025-01-23 18:50:59 +00:00
Animesh Jain
015c6d6fdb [dynamo][guards] Turn on profiling of guard manager (#145420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420
Approved by: https://github.com/ezyang
ghstack dependencies: #145351
2025-01-23 18:17:43 +00:00
Bartlomiej Stemborowski
3a8d3785f7 [ca][bug_fix] Fix ref counting of objects in the set_autograd_compiler function. (#145482)
PR#141153 exposed the option to collect sizes as dynamic. After this
change, the function set_autograd_compiler returns PyTuple object which
is populated using PyTuple_SET_ITEM function. Yet, that function steals
reference to the object and doesn't INCREF it. So currently we are
missing INCREF on prior_compiler when it is Py_None and INCREF on
prior_dynamic which is either Py_False or Py_True. This bug may lead to
the possible memory corruption.

@xmfan @jansel @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482
Approved by: https://github.com/albanD, https://github.com/jansel
2025-01-23 17:13:56 +00:00
cyy
29f52e3972 [2/N] Remove unnecessary once flag usage (#145057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057
Approved by: https://github.com/albanD
2025-01-23 09:48:46 +00:00
Shunting Zhang
d3f196909d [inductor] let inplace-padding support cpp-wrapper (#145325)
Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth.  One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed).

The fix is when we allocate the tensor, instead of doing something like:
```
  buf0 = randn_strided([2048, 2047], [2048, 1])
```
we do some small overallocation
```
  buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1])
```

cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #140249
2025-01-23 09:22:38 +00:00
Chirag Pandya
5e6451ea78 [c10] catch c10 error and log message (#145413)
Summary:
Explicitly catch c10 error and log the error message only.

The standard exception `e.what()` below ends up logging the stack trace that is confusing users.
See S477887 for details.

Test Plan:
tested locally.
```
buck test caffe2/test/cpp/c10d:TCPStoreTest
buck2 daemon constraint mismatch: Version mismatch; killing daemon...
Starting new buck2 daemon...
Connected to new buck2 daemon.
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68
Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918
Network: Up: 1.5GiB  Down: 4.7GiB  (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161)
Loading targets.   Remaining      0/3024                                                                                                                                 69199 dirs read, 687558 targets declared
Analyzing targets. Remaining      0/31483                                                                                                                                1481904 actions, 1719048 artifacts declared
Executing actions. Remaining      0/250391                                                                                                                               77:11:29.7s exec time total
Command: test.     Finished 2031 local, 45445 remote, 51473 cache (52% hit)                                                                                              20:16:36.9s exec time cached (26%)
Time elapsed: 7:32.7s
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D68516080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413
Approved by: https://github.com/fduwjj
2025-01-23 03:45:47 +00:00
Eddie Yan
de945d78da [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-22 22:42:48 +00:00
rzou
99dd1bf1b9 [compiled autograd] stop specializing on metadata during initial trace (#143417)
The previous PRs built up to this. We change compiled autograd's initial
trace to stop baking in metadata.

While tracing, we allocate some weirdly shaped tensors that we can put
proxies on. The initial trace should not be accessing any metadata of
these tensors (it will likely error out if it does because of how weird
the shapes are).

This involved fixing some various sites where we do specialize on the
metadata, like:
- we change CopySlices's apply_with_saved to proxy some calls
  into the graph (this change is fairly hard to split out by itself).
- we stop calling InputBuffer::add
- we delete the weird metadata from the graph so that no graph passes
  can make use of it.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143417
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387, #143405
2025-01-22 21:51:07 +00:00
rzou
ec820fe57c [compiled autograd] Always proxy autograd.Function nodes; handle AOT backwards (#143405)
We will always proxy autograd.Function nodes in compiled autograd's
initial graph capture (previously there was an
option to proxy vs trace into the autograd.Function)

We have some requirements for the AOTBackward. Compiled Autograd runs
accumulate grad reordering passes on the AOTBackward graph directly
after the initial graph capture, so we can't just proxy a single node for it.

Instead, we:
- proxy the AOTBackward prologue function into the CA graph
- copy-paste the AOTBackward graph into the CA graph
- trace directly through the epilogue (the traced nodes go into the CA
  graph).

Tracing through the epilogue is safe (assuming no Tensor subclasses)
because the only thing the epilogue does is drop some outputs. The
Tensor subclass situation was already broken so this doesn't regress
anything but this PR sets it up to be fixed (in a followup, where we
will proxy "make_subclass" calls into the graph from the epilogue).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143405
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304, #143387
2025-01-22 21:50:56 +00:00
rzou
784bb2127c [compiled autograd] Proxy nodes for user-defined C++ torch::autograd::Function (#143387)
We define a functional version of a C++ torch::autograd::Function. The
functional version reconstructs the ctx object and then calls
backward with it.

Some more details:
- we define how to pack/unpack ctx.saved_data into an IValue. It's a
  Dict[str, IValue], so it wasn't difficult.
- every call to CppNode::apply_with_saved binds a new function to
  Python. This is because we're unable to reuse the a previously bound
  function for reasons (the schema may change depending on what the user
  actually puts into their Dict[str, IValue]).

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143387
Approved by: https://github.com/jansel, https://github.com/xmfan
ghstack dependencies: #143296, #143304
2025-01-22 21:50:47 +00:00
rzou
8c7c5f7bfc [compiled autograd] Proxy a node for CopyBackwards into the graph (#143304)
CopyBackwards is a manual C++ torch::autograd::Node; we update its
apply_with_saved to proxy a functional version of it into the graph instead
of inlining into it.

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143304
Approved by: https://github.com/xmfan, https://github.com/jansel
ghstack dependencies: #143296
2025-01-22 21:50:37 +00:00
rzou
5531fafffe [compiled autograd] Proxy opaque nodes for built-in autograd nodes (#143296)
This PR is on the way to getting compiled autograd's initial capture to
stop specializing on Tensor metadata.

This PR changes compiled autograd's initial capture to proxy an opaque
(w.r.t. Dynamo) function into the graph for all built-in codegen'ed
autograd nodes and validate_outputs.

We changed each codegen'ed apply_with_saved (e.g.
MulBackward0::apply_with_saved) to call into Python to proxy a function
(compiled_autograd.ops.MulBackward0) into the graph. Then, we use the
node's InputMetadata to "guess" at the properties of the output Tensors
to create some new FakeTensors.

Some details:
- MulBackward0::apply_with_saved lives in libtorch_cpu, but needs to be
  call to Python via libtorch_python. There is an indirection
  (PyCompilerInterface) to do this.
- MulBackward0::apply_with_saved passes a C++ function to Python. To make
  our lives easier, every codegen'ed apply_with_saved passes a C++
  function with the same signature
  `(variable_list, ivalue_list) -> variable_list`.
- We define how to pack arbitrary C++ types into IValue via a helper
  IValuePacker struct and codegen functional variants of each builtin
  C++ autograd node (e.g. MulBackward0_apply_functional_ivalue).

MulBackward0 before this PR:
https://gist.github.com/zou3519/a80381d5fa38e970e413fcd91b0530de

MulBackward0 after this PR:
https://gist.github.com/zou3519/0c2eee8b3d8d96232b51ef430b53c5b0

Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143296
Approved by: https://github.com/jansel
2025-01-22 21:50:29 +00:00
albanD
0940eb6d44 Reverting the PR adding Kleidiai-based int4 kernels (#145392)
Mitigation for https://github.com/pytorch/pytorch/issues/145273
Reverting https://github.com/pytorch/pytorch/pull/134124 and https://github.com/pytorch/pytorch/pull/144074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145392
Approved by: https://github.com/ZainRizvi, https://github.com/malfet, https://github.com/atalman, https://github.com/digantdesai
2025-01-22 20:11:49 +00:00
Hyunho Yeo
057d9aff39 [S481486] [MTIA] Correct mtia.device_count() API (#145338)
Summary:
Prev: Count the number of "general" accelerators

Curr: Count the number of MTIA devices by using the MTIA runtime API

Test Plan:
```
buck test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r  test_get_device_count
```

https://www.internalfb.com/intern/testinfra/testrun/8162774572631995

Reviewed By: BoyueZheng

Differential Revision: D68472668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145338
Approved by: https://github.com/BoyueZheng, https://github.com/egienvalue
2025-01-22 17:45:15 +00:00
Isuru Fernando
0efa843392 Dynamic shape guards in C++ (#139899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139899
Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/jansel
ghstack dependencies: #143385, #143164
2025-01-22 14:58:35 +00:00
Edward Z. Yang
b3e90c8c33 Add support for torch function on dtype arguments (#145085)
Along the lines of https://github.com/pytorch/pytorch/issues/119194 although it doesn't actually address the FCD case.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145085
Approved by: https://github.com/vmoens, https://github.com/Skylion007
2025-01-21 17:44:47 +00:00
Joel Schlosser
3ee531f8b9 Support NJT chunk() backward on batch dim (#144584)
Part of my BE project addressing NJT bugs surfaced via OpInfo tests.

Implements `chunk()` backward on the batch dim, which was left out before. This PR unbinds the components and invokes `copy_()` on these to pass along the appropriate gradients.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144584
Approved by: https://github.com/soulitzer
ghstack dependencies: #144582, #144583
2025-01-18 15:58:24 +00:00
Will Constable
2859b11bdb [pytorch/ncclx] Remove Alltoallv specialization for PTD all_to_all (#145045)
Summary:
PTD all_to_all uses a list of tensors, while ncclAllToAllv (provided
by NCCLX and RCCL) assumes that a single contiguous buffer is used.
These are fundamentally mismatched.  The list of tensors might not be
contiguous or even ordered (buffer addresses might not be in
increasing order).

This patch removes the ncclAllToAllv specialization for PTD
all_to_all, and instead let's it directly call ncclSend/ncclRecv.

Co-authored by @pavanbalaji
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145045
Approved by: https://github.com/pavanbalaji, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/ezyang
2025-01-18 05:26:55 +00:00
Yanbo Liang
5d02575aa1 [Trace Python dispatcher] Support torch.DispatchKey & torch.DispatchKeySet (#144439)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144439
Approved by: https://github.com/zou3519
2025-01-17 02:26:36 +00:00
PyTorch MergeBot
6c713ccb5e Revert "Make functionalization ViewMeta serializable with pickle. (#143712)"
This reverts commit b8abdaa286.

Reverted https://github.com/pytorch/pytorch/pull/143712 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/143712#issuecomment-2597205261))
2025-01-17 00:52:50 +00:00
PyTorch MergeBot
94c0f15302 Revert "cpp_wrapper: Move #includes to per-device header files (#143909)"
This reverts commit d62b3979da.

Reverted https://github.com/pytorch/pytorch/pull/143909 on behalf of https://github.com/kit1980 due to breaking internal builds because of removal of torch‎/_inductor‎/codegen‎/aoti_runtime‎/implementation.cpp‎ ([comment](https://github.com/pytorch/pytorch/pull/143909#issuecomment-2597188669))
2025-01-17 00:36:38 +00:00
cyy
2ea394ba29 Modernize C++ code (#144603)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144603
Approved by: https://github.com/malfet
2025-01-17 00:25:18 +00:00
PyTorch MergeBot
4ea189422d Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit a6763b7b81.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2596895865))
2025-01-16 21:12:41 +00:00
garfield1997
3a5bf0bc36 expose extra torch_python apis (#144746)
Fixes #144302
After checking the code of my third-party devices, I think these APIs are also relied on by us, so I exposed them according to the discussion in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144746
Approved by: https://github.com/albanD
2025-01-16 20:50:31 +00:00
Yukio Siraichi
b8abdaa286 Make functionalization ViewMeta serializable with pickle. (#143712)
Fix: #141974

This PR makes `ViewMeta` sequence, present in functional tensors,
serializable with pickle. In order to accomplish that, it makes
`ViewMeta` an abstract class with overridable `forward` and `reverse`
functions. In this context, each operation that once instanciated
`ViewMeta`, should now create a new specialized class that inherits from
`ViewMeta. Therefore, this PR also uses codegen for creating these
specializations.

In summary, these are the changes this PR introduces:

- `ViewMeta` is turned into an abstract class (see
  _FunctionalStorageImpl.cpp_). `forward` and `reverse` are pure virtual
  functions that need to be implemented. `to_out_index` should be
  implemented by operations that might return more than 1 output.

- New `ViewMeta` specializations for `resize_` and `_unsafe_view` are
  created (see _FunctionalizeFallbackKernel.h_).

- New templates _ViewMetaClasses.{cpp,h}_ are created. They hold the
  declaration and definition of the `ViewMeta` specializations, which
  are automatically generated in the ATen codegen (see _gen.py_).

- New `_functionalization` Python sub-module is created (see
  _Module.cpp_). It serves as namespace for the `ViewMeta`
  specializations and `InverseReturnMode` enum.

- New template _ViewMetaClassesPythonBinding.cpp_ is created. It holds
  the automatically generated Python bindings for the `ViewMeta`
  specialization, which are generated in the torch codegen (see
  _generate_code.py_).

Note that this PR makes use of codegen at 2 different moments:

- ATen codegen (_gen.py_): generates the `ViewMeta` specialized classes.
- Torch codegen (_generate_code.py_): generated the Python bindings for
  them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143712
Approved by: https://github.com/bdhirsh
2025-01-16 19:41:41 +00:00
Tom Ritchford
13d35ea67a [BE] Add missing throw of std::runtime_error in scrc/cuda/utils.cpp (#144962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144962
Approved by: https://github.com/amjames, https://github.com/Skylion007, https://github.com/malfet
2025-01-16 17:35:39 +00:00
cyy
843627b7b1 Remove unnecessary once flag usage (#143255)
Static variables in C++11 is guaranteed to be initialised exactly once, as mentioned [here](https://en.cppreference.com/w/cpp/language/storage_duration)
```
If multiple threads attempt to initialize the same static local variable concurrently,
the initialization occurs exactly once
(similar behavior can be obtained for arbitrary functions with std::call_once.
Usual implementations of this feature use variants
of the double-checked locking pattern,
which reduces runtime overhead for already-initialized local statics
 to a single non-atomic boolean comparison.
```
Given that static c10::once_flag is used before, why not just use the associated function to initialised the related static variables? That is the motivation behind this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143255
Approved by: https://github.com/albanD
2025-01-16 02:36:11 +00:00
Brian Hirsh
4831f89790 support numbers as tensors for aten.copy(Tensor, Tensor) (#141161)
Fixes https://github.com/pytorch/pytorch/issues/141149. `aten.copy_` supports numbers as tensors in the python arg parser. So we need to give the same treatment to `aten.copy`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141161
Approved by: https://github.com/ezyang
2025-01-16 00:08:25 +00:00
Xu Han
2645fc45b1 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-01-15 23:43:41 +00:00
Benjamin Glass
d62b3979da cpp_wrapper: Move #includes to per-device header files (#143909)
This prepares us for the next PR in the stack, where we introduce pre-compiled per-device header files to save compilation time.

Differential Revision: [D67938955](https://our.internmc.facebook.com/intern/diff/D67938955)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143909
Approved by: https://github.com/desertfire
2025-01-15 21:14:02 +00:00
Zhengxu Chen
834086c023 [export] Load side info about pos/kw argument kind for serialization. (#144686)
Summary:
Fixing issue of nodes like
```
torch.ops.aten.linear.default(x, w, b)
```
being deserialized as
```
torch.ops.aten.linear.default(x, w, bias=b)
```
which breaks roundtripping.

Test Plan:
buck test mode/opt caffe2/test:test_export -- -r TestDeserialize
buck test mode/opt caffe2/test:test_export -- -r TestSerialize

Differential Revision: D67991410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144686
Approved by: https://github.com/angelayi
2025-01-15 19:08:38 +00:00
eqy
a6763b7b81 [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-15 18:37:55 +00:00
dilililiwhy
7c52c97a65 Expose several APIs to public (torch python APIs) (#144525)
Fixes #144302
Try to expose several APIs to public for privateuse1 scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144525
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-01-15 14:34:45 +00:00
Taher
d9d7cca009 make eval_frame safe (#141357)
Fixes #108942

this PR converts eval_frame.c's static extension types to heap types, making it thread and sub-interpreter safe.

the current modification only showcases one state variable being lifted, but there are opportunities for other variables that can be addressed in this PR

todo / suggestions:

1. uplift `eval_frame_callback_key` to module state
2. define `.m_slots` to module definition so initialization is within python's module lifecycle rather than an explicit `torch_c_dynamo_eval_frame_init`
3. define configurations for module allowing sub-interpreters or not

```c
static int module_exec(PyObject *m) {}

static PyModuleDef_Slot module_slots[] = {
    {Py_mod_exec, module_exec},
    {0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT,
     ....
    .m_slots = module_slots
};
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141357
Approved by: https://github.com/jansel

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2025-01-15 07:37:50 +00:00
fduwjj
ae7df51232 [c10d] Fix CudaEventCache for dangling references (#144496)
Reported in https://github.com/pytorch/pytorch/issues/143470, we have a dangling references in `CudaEventCache`. So we want to fix it.
1. We add a unit test to repro the issue mentioned in the issue.
2. Instead of converting variables to shared pointers as suggested in the issue, we then make the cache itself a shared pointer. So if the thread creates the cache dies before all events get recycled, the cache is still there until the last CudaEvent get deleted. (thanks for the suggestion from @kwen2501 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144496
Approved by: https://github.com/kwen2501
2025-01-15 05:11:48 +00:00
fduwjj
e0bbff6019 [c10d][ez] Add comments to the end of Macro for better readability (#144789)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144789
Approved by: https://github.com/c-p-i-o
2025-01-15 05:06:41 +00:00
Bin Bao
2683691237 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-14 18:47:42 +00:00
Zhenbin Lin
cbb1ed2966 [1/N] OpenReg: Replace open_registration_extension.cpp with openreg (#141815)
As described in OpenReg [next-steps](https://github.com/pytorch/pytorch/blob/main/test/cpp_extensions/open_registration_extension/README.md#next-steps), here we replace the current `open_registration_extension.cpp` test in PyTorch CI with openreg.

The current `open_registration_extension.cpp` contains two parts:
1. Implentations to support `PrivateUse1` backend.
2. Helper functions used for UTs in `test_cpp_extensions_open_device_registration.py` and `test_transformers.py`.

For the first part, we'll replace it with openreg. For the second part, we'll migrate them to ut files step by step.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141815
Approved by: https://github.com/albanD
2025-01-14 15:59:00 +00:00
lzhang2
1800f5f461 Enable coalescing path on XPU and dispatch to XPU tensor barrier if XCCL backend is specified. (#143735)
**Motivation:**

- Enable coalescing path on XPU for `batch_isend_irecv`.
- If XCCL backend is specified, then construct a XPU tensor to ensure `barrier` dispatch to XCCL backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143735
Approved by: https://github.com/kwen2501
2025-01-14 08:37:48 +00:00
Daulet Askarov
21cbee5d9b Drop unused num_elements variable (#144723)
Summary:
With the recent enforcement of unused variable as an error in D67329035, certain tests like
https://www.internalfb.com/intern/test/562950135258426?ref_report_id=0
can't build citing:
```
Action failed: fbcode//caffe2:libtorch_cuda (cfg:linux-x86_64-fbcode-platform010-clang17-no-san#2a7259832b2f5c67) (cxx_compile torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (pic))
Remote command returned non-zero exit code 1
Remote action, reproduce with: `frecli cas download-action a95a6625d2b071a782a7a8ea2882f4adccf103b023df5ccb596f48c506101754:145`
Stdout: <empty>
Stderr:
fbcode/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:3757:16: error: unused variable 'num_elements' [-Werror,-Wunused-variable]
 3757 |         size_t num_elements = output.numel();
      |                ^~~~~~~~~~~~
1 error generated.
```
This causes Sandcastle to turn off these tests, decreasing protection from other bad diffs. Clean up the unused variable to unblock.

Test Plan:
```
buck2 build --config hpc_comms.use_ncclx=dev --flagfile fbcode//mode/opt fbcode//ftar:ftar_py_e2e_test
```

https://www.internalfb.com/buck2/888dfc68-07eb-4ba1-add5-b38c12d52b33

Reviewed By: c-p-i-o

Differential Revision: D68126236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144723
Approved by: https://github.com/fduwjj, https://github.com/Skylion007

Co-authored-by: Daulet Askarov <dauleta@meta.com>
2025-01-14 08:29:01 +00:00
PyTorch MergeBot
64bcf39180 Revert "[CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)"
This reverts commit 388b75edec.

Reverted https://github.com/pytorch/pytorch/pull/144441 on behalf of https://github.com/kit1980 due to breaking internal builds: unused variable 'halpha' ([comment](https://github.com/pytorch/pytorch/pull/144441#issuecomment-2588517060))
2025-01-14 00:48:28 +00:00
PyTorch MergeBot
4f74864c94 Revert "[AOTI] Add a boxed_run API (#142213)"
This reverts commit 868984c3e3.

Reverted https://github.com/pytorch/pytorch/pull/142213 on behalf of https://github.com/kit1980 due to breaking lots of internal builds, see D68036023 ([comment](https://github.com/pytorch/pytorch/pull/142213#issuecomment-2588378262))
2025-01-13 22:43:47 +00:00
Bin Bao
684d015c2f [AOTI] Support _int_mm (#144571)
Summary: Add _int_mm to the C shim, to resolve a torchao issue, https://github.com/pytorch/ao/pull/1531#issue-2776827015

Differential Revision: [D68030385](https://our.internmc.facebook.com/intern/diff/D68030385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144571
Approved by: https://github.com/yushangdi
2025-01-13 20:32:29 +00:00
Yiming Zhou
87843ee9ab [export] Unify single and multiple return for hops (#143227)
Summary: Introduce `is_hop_single_tensor_return` field to the `Node` class in serialization so that during deserialization when there is a single return, we know whether it is a tuple of a single element or a single element.

Test Plan:
```
buck2 run @mode/dev-nosan sigmoid/inference/test:e2e_test_cpu -- -r E2ETestCPUCond
buck2 run @mode/dev-nosan sigmoid/inference/test:test_passes -- -r test_const_folding2
```

Differential Revision: D66991624

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143227
Approved by: https://github.com/zhxchen17
2025-01-13 03:31:14 +00:00
Nikita Shulga
9ae35b8bb1 [BE] Introduce c10::SyntaxError (#144647)
Which will be translated into Python's SyntaxError
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144647
Approved by: https://github.com/Skylion007
2025-01-12 23:23:54 +00:00
eqy
388b75edec [CUDA][cuBLAS] Add fp16 accumulate option to cuBLAS/cuBLASLt (#144441)
Test for `cublasGemmEx` added, still need to figure out the best way to exercise the other APIs...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144441
Approved by: https://github.com/Chillee
2025-01-11 15:30:38 +00:00
Nikita Shulga
92ddb3d3d3 [MPS] Expose MPSProfiler::start/stopCapture to Python (#144561)
I.e. when `MTL_CAPTURE_ENABLED` environment variable is set to 1, one should be able to invoke wrap the code with `torch.mps.profiler.capture_metal` to generate gputrace for shaders invoked inside the context manager.

For example, code below:
```python
import torch
import os

def foo(x):
   return x[:,::2].sin() + x[:, 1::2].cos()

if __name__ == "__main__":
    os.environ["MTL_CAPTURE_ENABLED"] = "1"
    x = torch.rand(32, 1024, device="mps")

    with torch.mps.profiler.metal_capture("compiled_shader"):
        torch.compile(foo)(x)
```
should capture the execution of a `torch.compile` generated shader
<img width="734" alt="image" src="https://github.com/user-attachments/assets/718ff64e-103b-4b11-b66c-c89cfc770b5d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144561
Approved by: https://github.com/manuelcandales
ghstack dependencies: #144559, #144560
2025-01-11 02:05:36 +00:00
Feny Patel
1ce3524277 use collective_comm activity for hccl traces (#144490)
Summary: Use existing collective_comm (currently used for nccl traces) for hccl traces as well. Only init the nccl profiler when KINETO_HAS_NCCL_PROFILER is defined so as to not init it when the build is for MTIA/HCCL

Test Plan: CIs

Differential Revision: D67285333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144490
Approved by: https://github.com/sraikund16
2025-01-10 18:39:35 +00:00
Bin Bao
868984c3e3 [AOTI] Add a boxed_run API (#142213)
Summary: Fixes https://github.com/pytorch/pytorch/issues/141696. Add a new C++ runner API (boxed_run) following dynamo's boxed calling convention, which steals tensors' ownership from the input tensor list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142213
Approved by: https://github.com/ezyang
2025-01-10 18:27:00 +00:00
cyy
9a841f9321 Enable bugprone-unchecked-optional-access (#144226)
We can actually enable bugprone-unchecked-optional-access without the risk of hang.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144226
Approved by: https://github.com/albanD
2025-01-10 03:16:56 +00:00
Yu, Guangye
6de110b862 Support with statement on torch.Stream (#140138)
# Motivation
We propose to support Python with statement on `torch.Stream`. This is a benefit for all accelerators when writing device-agnostic code. The device-specific stream will also be supported because they are generally derived from `torch.Stream`.

With this PR, we can do like this
```python
s1= torch.Stream()
# Set s1 to the current stream
torch.accelerator.set_stream(s1)
with torch.Stream() as s2:
    # Inside with statement, we set s2 to the current stream
    assert torch.accelerator.current_stream() == s2
# Here the current stream should be s1
assert torch.accelerator.current_stream() == s1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140138
Approved by: https://github.com/albanD
2025-01-10 02:05:19 +00:00
Richard Barnes
3e7e435bb1 [codemod] Remove unused-variable in caffe2/aten/src/ATen/native/quantized/cpu/fbgemm_utils.cpp +2 (#144371)
Summary:
LLVM-15 has a warning `-Wunused-variable` which we treat as an error because it's so often diagnostic of a code issue. Unused variables can compromise readability or, worse, performance.

This diff either (a) removes an unused variable and, possibly, it's associated code or (b) qualifies the variable with `[[maybe_unused]]`.

 - If you approve of this diff, please use the "Accept & Ship" button :-)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144371
Approved by: https://github.com/Skylion007
2025-01-09 21:49:17 +00:00
Aaron Gokaslan
bbec35f028 [BE]: Replace clone detach with detach clone to be more efficient (#144469)
Follow up to #144270 and fix some vulkan code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144469
Approved by: https://github.com/awgu
2025-01-09 18:28:39 +00:00
cyy
b0be30dd79 [19/N] Fix extra warnings brought by clang-tidy-17 (#144448)
Apply more clang-tidy fixes. There was a bug introduced by #144014 due to incorrect namespace concatenation which is reverted here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144448
Approved by: https://github.com/albanD
2025-01-09 15:58:05 +00:00
cyy
d0070ca07e [18/N] Fix extra warnings brought by clang-tidy-17 (#144014)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144014
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-01-08 17:21:55 +00:00
Richard Barnes
3beb7006dd c10::optional -> std::optional in a few places (#144340)
Test Plan: Sandcastle

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144340
Approved by: https://github.com/malfet
2025-01-07 21:09:39 +00:00
Aaron Gokaslan
e4a05dec0f [BE][Ez]: Fix docs recommending inefficient tensor op order (#144270)
`detach().clone()` is faster than `.clone().detatch()` since the gradients are not cloned. Let's update all the documentation and tests so that users do not use the inefficient op ordering.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144270
Approved by: https://github.com/awgu, https://github.com/XuehaiPan
2025-01-07 17:31:32 +00:00
PyTorch MergeBot
778d953951 Revert "[AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011)"
This reverts commit 24ac87392b.

Reverted https://github.com/pytorch/pytorch/pull/144011 on behalf of https://github.com/malfet due to Not sure what is going on, but lots of builds are failing ([comment](https://github.com/pytorch/pytorch/pull/144011#issuecomment-2574317669))
2025-01-07 03:24:01 +00:00
Yifu Wang
24ac87392b [AsyncMM] re-enable and prepare for cutlass 3.5.1 update (#144011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144011
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2025-01-07 02:15:42 +00:00
PyTorch MergeBot
aa14fcd96c Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit e141cb9c34.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/clee2000 due to still failing internally D67556174, see D67866123 for link to error ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2573652459))
2025-01-06 18:15:52 +00:00
Richard Barnes
e458b39fc4 c10::string_view -> std::string_view in Device.cpp (#144178)
Test Plan: Sandcastle

Differential Revision: D67817163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144178
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-01-04 18:51:33 +00:00
Dingming Wu
a881954b0c [PTD] Dump rcclexp proxy trace in pytorch (#143678)
Summary:
Dump the active proxyOp status per rank and per communicator when WatchDog timeout or aborts.

Added
`#if defined(USE_ROCM) && defined(NCCL_COMM_DUMP)` guard in the print function, so only rcclexp users will see this dump in console.

This is the changes of the PTD.

Test Plan:
Job with A2A hang due to receiver failing to post receive operations https://fburl.com/mlhub/95vg12r3
 {F1971449692}

Reviewed By: c-p-i-o

Differential Revision: D67036093

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143678
Approved by: https://github.com/c-p-i-o
2025-01-04 10:20:47 +00:00
Xiaodong Wang
0a94bb432e [ROCm] CK Flash Attention Backend (#143695)
Replace https://github.com/pytorch/pytorch/pull/138947 for re-import.

Replaces https://github.com/ROCm/pytorch/pull/1592

This PR contains the initial implementation of SDPA with composable_kernel backend. The CK path can be forced by simply calling torch.backends.cuda.preferred_rocm_fa_library("ck"). Similarly, you can force the incumbent aotriton implementation by passing in "aotriton" or "default". As you'd expect, not setting this option will result in aotriton to be used as the backend. In the case of CK, if pytorch deems flash attention usable, then it will use the CK path in all the same places aotriton would have been used. This PR makes no changes to the heuristics which select which attention scheme to use (i.e. flash attention vs memory efficient attention vs math etc etc). It only gets called when flash attention is both enabled (via USE_FLASH_ATTENTION) and is selected at runtime by the existing heuristics.

Files located in pytorch/aten/src/ATen/native/transformers/hip/flash_attn/ck/mha* have been pulled from https://github.com/Dao-AILab/flash-attention courtesy of @tridao's hard work who is the co-author

NOTE: In order to use this backend, the user MUST set USE_CK_FLASH_ATTENTION=1 in their environment when they build PyTorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143695
Approved by: https://github.com/malfet

Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2025-01-03 22:01:36 +00:00
Animesh Jain
732359c633 [dynamo][easy] Minor fixes in guards.cpp (#144130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144130
Approved by: https://github.com/williamwen42
ghstack dependencies: #144129
2025-01-03 18:22:56 +00:00
Benjamin Glass
b5b419d627 cpp_wrapper: Use runtime dispatched fallbacks for complex ops (#143223)
When calling a fallback op in cpp_wrapper mode, where any of the inputs are complex numbers, utilize the runtime dispatched fallback mode. This properly handles the Conjugate and Negative dispatch keys, if present, in exchange for a performance pessimization in complex arithmetic.

This PR additionally fixes some cascading failure modes exposed in our `aot_inductor` tests by this change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143223
Approved by: https://github.com/desertfire
ghstack dependencies: #141371
2025-01-03 16:05:38 +00:00
Xu Han
e141cb9c34 export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2025-01-03 05:41:06 +00:00
Bin Bao
c5b75f8db1 [AOTI] Remove more AOTI_TORCH_EXPORT (#144081)
Summary: Similar to https://github.com/pytorch/pytorch/pull/142500, remove redundant AOTI_TORCH_EXPORT from several cpp files, to solve a windows build issue.

Differential Revision: D67762069

Pull Request resolved: https://github.com/pytorch/pytorch/pull/144081
Approved by: https://github.com/yushangdi
2025-01-03 02:17:38 +00:00
Michal Gallus
37e9da0687 [ROCm][Windows] Disable roctracer-related code (#143329)
Currently, the roctracer for Windows is not available. This PR disables any mentions of its usage for Windows, and creates dummy functions for Windows to keep compatibility with existing code, but which warn the user about the lack of Windows' availability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143329
Approved by: https://github.com/sraikund16
2025-01-03 01:51:01 +00:00
Driss Guessous
a8c98ce175 [cutlass-3] Update third-party/cutlass-3 from 3.4 to 3.5.1 (#143515)
# Summary:

This also makes updates to different repositories throughout FB code to roll any updates needed for this new release.

I was not able to get AsyncMM.cu to build (still trying) Yfiu suggested that I just skip it for now

Test Plan:
Have run various build commands to try and expose errors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143515
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-01-02 18:45:11 +00:00
Yu, Guangye
3848de55ed Add get_stream_from_external API for CUDA backend (#143799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143799
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119, #141123
2024-12-31 11:15:59 +00:00
Yu, Guangye
8f6c4d1732 Add get_stream_from_external API for XPU backend (#141123)
# Motivation
This PR aims to introduce `torch.xpu.ExternalStream` to be used to wrap SYCL queue created in other libraries to PyTorch.

# Additional Context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141123
Approved by: https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #142347, #141119
2024-12-31 11:15:52 +00:00
cyy
8df99b6a6e Remove unneeded std::make_optional (#143575)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143575
Approved by: https://github.com/Skylion007
2024-12-31 03:08:47 +00:00
Kai Londenberg
cb24013b5b Fix assertion failure in pytorch profiler (#143940)
Summary:
Attempt to fix the following exception which occurred when profiling a Pytorch model ( Meta-internal LLM ) that also involved a ThreadPoolExecutor in the background:
```
Exception Found: !stack.empty() INTERNAL ASSERT FAILED at "fbcode/caffe2/torch/csrc/autograd/profiler_python.cpp":987, please report a bug to PyTorch. Python replay stack is empty.
```
The root cause of this issue seems to be that a thread call stack can be empty, which is asserted to not be empty.

I fixed this with some minimal changes to profiler_python.cpp

Approach:
 * Ensuring that the stack in question is not empty before trying to pop from it.

Test Plan:
* Tested manually on a reproducible scenario where the assertion failure was otherwise triggered ( repro too large to include here ). The assertion failure disappears.
 * CI

Differential Revision: D67691558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143940
Approved by: https://github.com/Skylion007, https://github.com/sraikund16
2024-12-31 01:43:04 +00:00
cyy
af629a8146 Enable readability-redundant-declaration (#143982)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143982
Approved by: https://github.com/Skylion007
2024-12-31 00:20:10 +00:00
cyy
dca443835e Enable more readability-redundant checks (#143963)
They are helpful to simplifying code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143963
Approved by: https://github.com/albanD
2024-12-30 14:49:33 +00:00
PyTorch MergeBot
45a709d9ec Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit cbc4cf3043.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see d8c3900d80/1 ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))
2024-12-28 16:49:38 +00:00
PyTorch MergeBot
8cccc46e33 Revert "Add AOT inductor support for _scaled_mm for CPU (#141961)"
This reverts commit 3fabd10c40.

Reverted https://github.com/pytorch/pytorch/pull/141961 on behalf of https://github.com/malfet due to It broke the same test, but on ROCM this time, though it was classified as flaky for some reason, see d8c3900d80/1 ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2564378146))
2024-12-28 16:49:38 +00:00
Jiang, Yanbing
3fabd10c40 Add AOT inductor support for _scaled_mm for CPU (#141961)
This PR is to add AOT inductor support for _scaled_mm for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141961
Approved by: https://github.com/malfet
ghstack dependencies: #139975
2024-12-28 05:57:35 +00:00
Jiang, Yanbing
cbc4cf3043 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
2024-12-28 05:49:06 +00:00
cyy
379bbef23c Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-27 05:46:57 +00:00
PyTorch MergeBot
fca457b5db Revert "Add torch._scaled_mm for CPU (#139975)"
This reverts commit 3f80632c80.

Reverted https://github.com/pytorch/pytorch/pull/139975 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing some tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/139975#issuecomment-2563331259))
2024-12-27 05:25:17 +00:00
Animesh Jain
e296bab614 [dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
2024-12-27 04:51:35 +00:00
Chirag Pandya
1cd70e7e23 [fr][c10d] log trace capture enabled or not in flight recorder (#143865)
Summary:
Refactor logging for flight recorder so we can log if the capture was
with or without stack trace capture enabled.
We introduce a new column ('trace_enabled') in the logger.

Test Plan:
Tested on local job and noted that correct output was produced.
Internal link: https://fburl.com/scuba/c10d_flight_recorder/ulhqnmhg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143865
Approved by: https://github.com/fduwjj
2024-12-27 03:07:55 +00:00
Jiang, Yanbing
3f80632c80 Add torch._scaled_mm for CPU (#139975)
This PR is to add `torch._scaled_mm` for CPU backend.

`_scaled_mm_out_cpu` and `_scaled_mm_cpu` are new added and included in `torch._scaled_mm` CPU dispatch. We also add `_scaled_mm_out_cpu_emulated` as a fallback function if the current platform cannot run FP8 matmul using oneDNN. And this PR also updates the various UTs related to FP8 to support CPU tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139975
Approved by: https://github.com/mingfeima, https://github.com/jgong5, https://github.com/malfet
ghstack dependencies: #139974
2024-12-26 22:22:42 +00:00
PyTorch MergeBot
26364428f5 Revert "[dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)"
This reverts commit fe95cbe018.

Reverted https://github.com/pytorch/pytorch/pull/143722 on behalf of https://github.com/wdvr due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/143722#issuecomment-2563127017))
2024-12-26 22:04:36 +00:00
PyTorch MergeBot
9255ffc841 Revert "Enable more C++ warnings (#143355)"
This reverts commit daa3ffe0eb.

Reverted https://github.com/pytorch/pytorch/pull/143355 on behalf of https://github.com/malfet due to It fails internal build system as it kind of breaks separation between native and native/cpu ([comment](https://github.com/pytorch/pytorch/pull/143355#issuecomment-2562961546))
2024-12-26 17:13:10 +00:00
cyy
f42cff4e29 [17/N] Fix extra warnings brought by clang-tidy-17 (#143804)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143804
Approved by: https://github.com/Skylion007
2024-12-25 19:54:42 +00:00
Jason Ansel
b0c3f48a40 [inductor] Improve error message for assert_size_stride (#143765)
```
>>> torch._C._dynamo.guards.assert_size_stride(torch.randn(10), (10,), (2,))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError: expected size 10==10, stride 1==2 at dim=0
This error most often comes from an incorrect meta function for a custom op.
See https://pytorch.org/docs/stable/library.html#torch.library.opcheck
>>>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143765
Approved by: https://github.com/zou3519
2024-12-24 05:26:05 +00:00
cyy
1feae27ed6 [16/N] Fix extra warnings brought by clang-tidy-17 (#143714)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143714
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-12-24 03:29:38 +00:00
Animesh Jain
fe95cbe018 [dynamo] Remove DICT_SUBCLASS_GUARD_MANAGER and use dict.keys (#143722)
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
2024-12-24 02:00:18 +00:00
Natalia Gimelshein
2ab698e708 allow profiling on all threads via experimentalConfig (#143659)
In some situations we want to profile calls coming from all threads (similar to on-demand), not just the thread that started profiling and the spawned threads that would inherit KinetoThreadLocal state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143659
Approved by: https://github.com/sraikund16
2024-12-23 20:41:27 +00:00
Aaron Orenstein
06b4b96b34 dynamo tracing perf: no re in arg_ref: 33.9 -> 33.7 (#143069)
See #143056 for overall docs.

This PR: Avoid use of python re and move valid varname check in
`GuardBuilder.arg_ref()` into C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143069
Approved by: https://github.com/jansel
2024-12-23 05:32:09 +00:00
Aaron Orenstein
9bf4b1c2e9 dynamo tracing perf: c++ strip_function_call: 49.12 -> 47.77 (#143063)
See #143056 for overall docs.

This PR: Convert `strip_function_call()` into C++

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143063
Approved by: https://github.com/jansel
ghstack dependencies: #143057, #143062
2024-12-22 06:38:46 +00:00
cyy
daa3ffe0eb Enable more C++ warnings (#143355)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143355
Approved by: https://github.com/albanD
2024-12-21 09:19:02 +00:00
PyTorch MergeBot
e15442a9b2 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit 6733045a4a.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but my first attempt to fix internal build does not fix all the cases, so let us try again ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2558043056))
2024-12-21 08:06:19 +00:00
Tristan Rice
6e58c37542 c10d: no call_guard in init (#143598)
`py::call_guard<py::gil_scoped_release>` is not safe when using multiple threads. This instead moves it into the init function which is safe.

For more details see #143593

https://github.com/pybind/pybind11/issues/5473

Test plan:

```
python setup.py develop
```

CI

```py
import time
from concurrent.futures import ThreadPoolExecutor
from torch import distributed as dist

def run():
    store = dist.TCPStore(
        host_name="localhost",
        port=0,
        is_master=True,
        wait_for_workers=False,
    )

    # this sleep is required to trigger the crash
    time.sleep(0.1)
    del store

futures = []
with ThreadPoolExecutor(
    max_workers=100,
) as executor:
    for i in range(100000):
        print(i)
        futures.append(executor.submit(run))
        if len(futures) > 100:
            futures.pop(0).result()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143598
Approved by: https://github.com/c-p-i-o
2024-12-20 22:23:36 +00:00
Chirag Pandya
485497e727 [c10d][fr] flight recorder improvements (#143446)
Summary:
1. Flight recorder dumps are now automatically dumped by default upon
   timeout or exception. Users don't need to opt-in.
2. Change default dump location to running user's home directory
   `.cache` folder.

Test Plan:
1. Tested locally by running the crash program from flight recorder
   tutorial page.
   https://pytorch.org/tutorials/prototype/flight_recorder_tutorial.html#an-end-to-end-example
2. Noted that flight recorder files were correctly created.
❯ pwd
/home/cpio/.cache/fr_trace
❯ ls
nccl_trace_rank_0  nccl_trace_rank_1

Differential Revision: [D67363720](https://our.internmc.facebook.com/intern/diff/D67363720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143446
Approved by: https://github.com/d4l3k
2024-12-20 20:41:30 +00:00
Nikhil Gupta
94737e8a2a [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-20 19:32:03 +00:00
William Wen
7ab880bc5e fix typo in autocast header (#143625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143625
Approved by: https://github.com/mlazos
ghstack dependencies: #143592
2024-12-20 16:17:15 +00:00
Guilherme Leobas
673cc88fd6 Add support for contextmanager in Dynamo (#136033)
Fixes #130559

* Intro

This PR adds support for `@contextmanager` in Dynamo. We chose to limit the
scope of this work to only `@contextmanager` and plan to handle generators fully
in #141055 (still in draft).

* Motivation

Dynamo lacks support for generator functions. When it encounters one, it traces
it as if it were a regular function. This is problematic because it can lead to
incorrect behavior. To illustrate, consider the test case below:

```python
import torch
import contextlib

@contextlib.contextmanager
def set_default_dtype(dtype):
    old_dtype = torch.get_default_dtype()
    try:
        torch.set_default_dtype(dtype)
        yield
    finally:
        torch.set_default_dtype(old_dtype)

@torch.compile(backend="eager", fullgraph=True)
def fn():
    with set_default_dtype(torch.float64):
        x = torch.tensor([3.0, 3.0 + 5.0j])
    return x
```

Before this work, Dynamo would not stop at the `yield`, and the graph produced
would contain both calls to `set_default_dtype` executed one after the other.
This is incorrect because the context manager should execute code before and
after the `yield`.

* List of changes

`YIELD_VALUE` now raises an exception (`YieldValueOp`) to signal that control
flow must be suspended and returned to the caller. Additionally, `RETURN_VALUE`
behaves differently in a generator function. Unlike regular functions, where
`RETURN_VALUE` indicates the final result, in generators it signifies that the
generator is exhausted and implicitly raises `StopIteration`.

A new `VariableTracker` named `FunctionDecoratedByContextlibContextManagerVariable`
was introduced to handle `@contextmanager`. This variable tracker acts not just
as a wrapper for the original function but also maintains an internal `tx`
(InstructionTranslator) object to suspend and return control flow to the parent
tracer when a `yield` is encountered.

* Corner cases

Returning a context manager from a compiled function is not supported. This
would require PyTorch to synchronize the generator state between Dynamo and the
interpreter. Any attempt to return it will result in an `IncorrectUsage`
exception.

Graph breaks require special handling as well. In the event of a graph break,
the frame associated with the context manager is skipped, and the context
manager runs in eager mode.

* This PR is breaking my code

There is a configuration flag (`enable_trace_contextlib`) that can be set to
`False` to disable tracing context managers. If this still causes crashes,
please revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136033
Approved by: https://github.com/zou3519
2024-12-20 12:02:20 +00:00
Xu Han
6733045a4a export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Reland:
1. Declare export on Windows explicitly.
2. Support cpu, cuda and xpu devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-20 11:42:09 +00:00
William Wen
1c2593f035 [dynamo] guard global autocast state (#143592)
Fixes https://github.com/pytorch/pytorch/issues/112260.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143592
Approved by: https://github.com/jansel
2024-12-20 03:30:54 +00:00
PyTorch MergeBot
8136daff5a Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit 4b82251011.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks lots of internal build ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2555953189))
2024-12-19 23:33:17 +00:00
Nikhil Gupta
4b82251011 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-19 18:51:26 +00:00
Animesh Jain
465f282a24 [reland][dynamo][guards] Consider tensors as immutable for dict tag matches (#141085)
Reland - https://github.com/pytorch/pytorch/pull/139560

As mentioned in https://github.com/pytorch/pytorch/pull/130341, using `static py::object` can lead to segfaults. I suspect this is the reason for the import system error seen internally (https://www.internalfb.com/sevmanager/view/469592). In this PR, I am removing the `static` part. This is fine and also the right thing to do because this will catch if user changes the flag in the same process for compiling two different functions.

Unfortunately, there is no easy way to trigger this segfault, so I can't write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141085
Approved by: https://github.com/jansel

Co-authored-by: William Wen <williamwen@meta.com>
2024-12-19 15:16:10 +00:00
Joe Wang
451c233936 leaking c++ singleton specifically (#143509)
Summary:
fix forward for S477887

leaking c++ singleton specifically

when c++ shutdown, it tries to destruct the singleton and acquire GIL, at this moment python runtime exists already, causing undefined behavior.
Leaking here specifically so that we won't try to destroy singleton at the shutdown phase

Test Plan: n/a

Differential Revision: D67400633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143509
Approved by: https://github.com/c-p-i-o
2024-12-19 09:27:07 +00:00
Sean Xiao
e4301aeaa5 [ODML] Make the ML feature provider thread safe (#143418)
Summary:
This PR is generated from a meta internal Diff, aiming to resolve a crash from a race condition on the dictionary.

Test Plan:

Build and run

Print out the count/name/value of the dictionary and see if the values are get/set/removed correctly.

Observe the print statement on app start within IG

@diff-train-skip-merge

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143418
Approved by: https://github.com/shoumikhin
2024-12-19 04:47:56 +00:00
PyTorch MergeBot
14fe1f7190 Revert "[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)"
This reverts commit d3ff2d42c2.

Reverted https://github.com/pytorch/pytorch/pull/134124 on behalf of https://github.com/malfet due to This broke S390 builds, includes cpuinfo unconditionally ([comment](https://github.com/pytorch/pytorch/pull/134124#issuecomment-2552560208))
2024-12-19 01:05:11 +00:00
Nikhil Gupta
d3ff2d42c2 [ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124)
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.

2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.

3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).

4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights,  groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.

API Usage: https://github.com/pytorch/pytorch/issues/143289

Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode  : 40  t/s
2B Transformer model
Prefill : 747 t/s
Decode  : 80  t/s

Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s

OK

python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s

OK

python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s

Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-12-18 22:30:07 +00:00
James
d4ed5941db Fix floating point literals in IRPrinter (#142119)
Fixes #114035
This is a recreation of #140002 with approval from its author. Original description:
>when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.'

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-12-18 21:59:48 +00:00
FFFrog
f47aac6bc2 Make Context to be Device-agnostic Step by Step (3/N) (#137578)
Detailed Descriptions:
- Using unified Device-agnostic API to create new generator for accelerator.
- Add deprecated info for GeneratorForPrivateuseone

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137578
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-12-18 15:12:19 +00:00
Zizeng Meng
eb67dd3e2d [3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149)
Design Doc: https://fburl.com/gdoc/47zpuweb
Prototyping:  D66469341

In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot.

In next diff, we will integrate the mtia backend with profiler python api

Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149
Approved by: https://github.com/nautsimon
2024-12-18 11:58:23 +00:00
Sheng Fu
2ea4b56ec8 Record min/max of integral tensor in ET (#143088)
Summary:
In et-replay, random data is used to run the operators. However, it does not work well for the op that uses index to access tensor. For example, embedding ops, which use the indices to look up the embedding table. If random data is used for these index ops, et-replay usually runs into invalid memory access issue.

To fix it, ET provides an environment variable "ENABLE_PYTORCH_EXECUTION_TRACE_INTEGRAL_TENSOR_RANGE", if it is set, ET will capture the min/max value of the flattened integral tensor. Then in et_replay, the min/max is used to generate the random tensor within that range. It fixed invalid memory access issue.

Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_range_cuda

Differential Revision: D66666931

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143088
Approved by: https://github.com/sanrise
2024-12-18 08:20:35 +00:00
Shangdi Yu
c17a07ade3 Add float8 support in serde schema (#143343)
Summary:
Fix https://github.com/pytorch/pytorch/issues/141316

Bump up schema minor version.

as title, add float8 support in serde schema

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r  test_serialize_float8
```

Differential Revision: D67307670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143343
Approved by: https://github.com/yiming0416
2024-12-18 05:07:21 +00:00