Commit Graph

462 Commits

Author SHA1 Message Date
Aaron Gokaslan
d3c2123ea6 [BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686)
* This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly.
*  Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels.
* Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time.
* These kernels and settings will be needed for Flash Attention 3 whenever we add that too.

Fixes #133695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686
Approved by: https://github.com/ezyang
2024-09-28 21:11:15 +00:00
James Wu
96104db132 [easy] fix typo in debug logs for fx graph cache (#136889)
Summary: Accidentally messed up the debug logging here, fixing typo (scuba + tlparse logging is unaffected)

Test Plan: existing tests

Differential Revision: D63555766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136889
Approved by: https://github.com/oulgen
2024-09-28 03:56:09 +00:00
Oguz Ulgen
9abdc62065 Allow fx graph caching higher order operators (opt-in) (#135877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877
Approved by: https://github.com/zou3519
2024-09-24 17:23:09 +00:00
Aaron Orenstein
9fc721d22b Add cache logs + other minor caching cleanup (#136456)
Summary:
- Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache.
- Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other)
- Prepare `_ManifoldCache` for use with other subpath keys
- Move create_cache to be more public and use it in codecache
- Add _InductorMetaTy alias (still just a dict)
- Cleaned up some common cached_autotune calls in triton_heuristics

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D62648249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456
Approved by: https://github.com/oulgen
2024-09-24 14:00:23 +00:00
PyTorch MergeBot
e9bfbf78d5 Revert "Allow fx graph caching higher order operators (opt-in) (#135877)"
This reverts commit 66d5eb64e0.

Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))
2024-09-23 09:04:24 +00:00
Oguz Ulgen
66d5eb64e0 Allow fx graph caching higher order operators (opt-in) (#135877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877
Approved by: https://github.com/zou3519
2024-09-23 04:33:27 +00:00
James Wu
7537f74277 Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491)
Summary:
We refactor FxGraphCache.load into three phases:
- prepare_key, which checks that an inductor input is cacheable and bypasses otherwise
- load_with_key, which tries to lookup the key in the cache
- post compile, where we do some logging and run post compile steps

Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc.

Differential Revision: D62314862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491
Approved by: https://github.com/oulgen
2024-09-16 19:48:08 +00:00
Bin Bao
b4c84c3167 [AOTI] Fix a fallback op returning None issue (#135997)
Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor.

Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997
Approved by: https://github.com/chenyang78
2024-09-14 18:12:06 +00:00
Oguz Ulgen
3352c9ac94 Add higher order operator name to the cache bypass exception (#135876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2024-09-14 07:05:29 +00:00
James Wu
ad2f0e9f81 Add remote cache time saved to compilation metrics (#135490)
Summary:
Record remote cache time saved via frame_phase_timing

We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved.

Test Plan:
Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized.

Show that column exists in table:

https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff

Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float.

Reviewed By: aorenste

Differential Revision: D62106921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490
Approved by: https://github.com/aorenste
2024-09-13 16:35:51 +00:00
angelayi
cd9ee49a69 [aoti] Add cpp loader (#135374)
* Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python...
* Added a new config, `aot_inductor.package_cpp_only` which will **not** package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users.
* Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config.
* Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`.
* `load_package` will load a singular model, given the model name.
* The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows?

Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374
Approved by: https://github.com/desertfire, https://github.com/malfet
2024-09-11 03:00:01 +00:00
xinan.lin
67735d1ee8 [Inductor] Generalize is_cuda to specific device_type to make cpp_wrapper mode be extensible (#134693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel
2024-09-10 10:11:13 +00:00
Oguz Ulgen
13ba0a2e5c Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175)
Fixes #135172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175
Approved by: https://github.com/masnesral, https://github.com/ezyang
2024-09-06 19:03:57 +00:00
leslie-fang-intel
07689a38bf [Inductor] Fix AOT weight alignment issue on CPU (#135205)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-09-06 03:06:51 +00:00
Oguz Ulgen
2dadc2c8fc Log fx graph cache bypass reasons (#134792)
Summary: Lets track when we bypass and why

Test Plan: unit tests

Differential Revision: D61994739

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134792
Approved by: https://github.com/jamesjwu
2024-09-01 19:02:09 +00:00
Aaron Orenstein
7239b8a4f1 Clean up RemoteCache classes (#134032)
Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032
Approved by: https://github.com/oulgen, https://github.com/bhack
2024-08-31 20:18:59 +00:00
Nikita Shulga
af82dc816a Fix lint failures (#134488)
Introduced by https://github.com/pytorch/pytorch/pull/131000

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134488
Approved by: https://github.com/Skylion007, https://github.com/msaroufim, https://github.com/albanD, https://github.com/atalman
2024-08-26 20:13:21 +00:00
eqy
3541e450af Support larger page sizes with use_mmap_weights (#131000)
Fixes e.g., `test_large_mmaped_weights_non_abi_compatible_cuda` on machines with 64K page size

CC @malfet @tinglvv @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131000
Approved by: https://github.com/malfet
2024-08-26 18:35:55 +00:00
James Wu
3c5485fb7f [Retry] Log chromium events to scuba (#134118)
Summary:
This diff implements a bunch of views for internal scuba viewing.

TODOS that I might punt to another diff:
- Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more.
- We should definitely log frame id, compile id, etc
- We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on.
- idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk

Test Plan:
All of the above views are run with nanogpt benchmark:

```
buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance
```

Differential Revision: D61603243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118
Approved by: https://github.com/oulgen
2024-08-22 14:59:45 +00:00
Xu Han
5fb8754434 [inductor] write cpp code with encoding utf-8 (#134027)
Windows is different to Linux, each Windows version with different language pack have different code page.
Inductor on Windows will write the genarated cpp code with its code page, and it should occured un-decode character failed.

For this situlation, Microsoft suggest to use Unicode to instead of a specific code page. Ref: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers

Changes:
1. Use `utf-8` as encoder for cpp code.
2. It only change encode for cpp code, but not for binary type. binary type is for AoT binary context.

It works on https://github.com/pytorch/pytorch/issues/122094#issuecomment-2299592942.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134027
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/jansel
2024-08-22 11:54:32 +00:00
Xu Han
fbf3fc2a30 [inductor] Use int64_t as index type for all platfroms 4 (#133892)
It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments.
1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: bdc14ad89a
2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: 3a56b76ce0

------------------------------
Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform.

### Development notes:
The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms.
Current code is make code cumbersome:
```python
INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long"
```

So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`.
For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768
For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782

Since that, we still discussed which type we will select as final solution.
![image](https://github.com/user-attachments/assets/b23fa577-2d40-4bd6-b934-fb7994fe0bb0)

`long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782.

As https://github.com/pytorch/pytorch/pull/133782 still has two issues:
1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812
4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR.

So, we made final solution in this PR.

### Changes:
**1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`.**
**2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`).**
**3. Update `parse_arg` function signature to `int64_t`, which follow the index type.**
**4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long.**
**5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`.**
**6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892
Approved by: https://github.com/jansel
2024-08-20 16:54:12 +00:00
Oguz Ulgen
65b3e42074 Warn on fx graph cache bypass and log it to tlparse (#133826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133826
Approved by: https://github.com/aorenste
2024-08-19 23:39:55 +00:00
Aaron Orenstein
68fcd54226 Lower cache mocking to test more pytorch code (#133579)
Summary: Previously we were mocking out FbRemoteFxGraphCacheBackend which meant that we were missing testing a whole bunch of the cache code. Cache at a lower level (CacheClient, LocalAutotuneCacheBackend, ManifoldClient, Redis) so we cover a larger amount of the caching code.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D60937966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133579
Approved by: https://github.com/oulgen
2024-08-19 16:32:36 +00:00
Oguz Ulgen
12b8e29203 Add a fudge factor to ephemeral NCCL timeout increase (#133722)
Differential Revision: [D61422431](https://our.internmc.facebook.com/intern/diff/D61422431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133722
Approved by: https://github.com/c00w, https://github.com/aorenste
ghstack dependencies: #133504
2024-08-17 03:08:40 +00:00
Oguz Ulgen
455f6bda56 Add cache timings info to tlparse (#133504)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json

Differential Revision: [D61422432](https://our.internmc.facebook.com/intern/diff/D61422432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504
Approved by: https://github.com/jamesjwu
2024-08-17 01:37:53 +00:00
Oguz Ulgen
0063e56949 Make FX Graph Cache work with distributed training (#133374)
During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.

Differential Revision: [D61363722](https://our.internmc.facebook.com/intern/diff/D61363722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
2024-08-16 18:51:14 +00:00
Bill Yoshimi
4ee65c7e4e Add message text to BypassFxGraphCache exceptions. (#133505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133505
Approved by: https://github.com/oulgen
2024-08-16 18:02:59 +00:00
Xuehai Pan
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
PyTorch MergeBot
07adae3dac Revert "Make FX Graph Cache work with distributed training (#133374)"
This reverts commit dcdb25453e.

Reverted https://github.com/pytorch/pytorch/pull/133374 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))
2024-08-15 13:43:16 +00:00
PyTorch MergeBot
32d890745d Revert "Add cache timings info to tlparse (#133504)"
This reverts commit 7eb31e5023.

Reverted https://github.com/pytorch/pytorch/pull/133504 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))
2024-08-15 13:43:16 +00:00
Oguz Ulgen
7eb31e5023 Add cache timings info to tlparse (#133504)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504
Approved by: https://github.com/jamesjwu
ghstack dependencies: #133362, #133363, #133374
2024-08-15 05:53:00 +00:00
Oguz Ulgen
dcdb25453e Make FX Graph Cache work with distributed training (#133374)
During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
ghstack dependencies: #133362, #133363
2024-08-14 22:58:48 +00:00
Xu Han
36c4ed8e49 [inductor] add FreeLibrary to DLLWrapper for Windows. (#133184)
For previous PR https://github.com/pytorch/pytorch/pull/132630 . We found `DLLWrapper` class doesn't have `_dlclose` implemention for Windows.

I write a small test project to figure out how to make it works on Windows: https://github.com/xuhancn/ctypes_all_lifecycle/blob/main/pysrc/module_manage.py#L30-L61
Test result: https://github.com/xuhancn/ctypes_all_lifecycle/tree/main?tab=readme-ov-file#ctypes_cyclepy

So, I have port the Windows FreeLibrary implemention to pytorch DLLWrapper in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133184
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-12 19:55:48 +00:00
Xu Han
4a3a30c36e [inductor] remove deprecated cpp_builder implementation. (#133161)
I have worked with @henrylhtsang to switch the cpp_builder to new one. We have removed the dependency to the old implementation.
So, it is time to remove the old implementation now. This PR is done the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133161
Approved by: https://github.com/ezyang
2024-08-10 14:21:22 +00:00
Xu Han
2ad011ca73 [inductor] remove debug code of AotCodeCompiler (#132823)
Since we switch AotCodeCompiler to new cpp_builder: https://github.com/pytorch/pytorch/pull/132766
We can remove debug code of AotCodeCompiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132823
Approved by: https://github.com/henrylhtsang
2024-08-10 08:04:48 +00:00
James Wu
f037803290 Add ChromiumEventLogger, log FXGraphCache and AOTAutogradCache (#132864)
This PR implements ChromiumEventLogger in all @dynamo_timed events. For each dynamo timed call, we log:
- A start event before starting the function execution
- An end event after finishing the function execution
- An extra pair of start/end events for any phase names included in dynamo.

Separately, this also gives us the ability to log instant events. I use them to log cache hits/misses as a first step. The little arrows on the bottom of the UI are cache hits/misses, and you can look at cache details by clicking each triangle.

The outputted chromium trace events can be viewed in perfetto for a timeline of an execution. Here's what it looks like for a run of nanogpt:
![image](https://github.com/user-attachments/assets/cb9e6c7a-1acf-45e6-8a27-6651d9ae6132)

And another with warm start:
![image](https://github.com/user-attachments/assets/cd9709bc-59ef-4da1-a7dd-10b1a0ab9b8f)

Trace events are based around the JSON Event format: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview

We may want to switch to the less deprecated Protobuf format later, but so far I don't see any features we care about supported there.

Internal FB employees can see a link to this in the tlparse output:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpVi1FIl/dedicated_log_torch_trace_bb4zl_bc.log/index.html

I'll also work on logging these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132864
Approved by: https://github.com/aorenste
2024-08-10 01:15:53 +00:00
Henry Tsang
78cf8df4a0 [aoti] forward fix of [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#133042)
Summary:
Forward fix of a test failure caused by D60773405.

The idea of D60773405 is that we need to use absolute path. So we will want to use the older version of path for output_so and output_o.

However, when I was copying the older definitions of output_so and output_o, I thought it was okay to simplify it a bit. See https://github.com/pytorch/pytorch/pull/131304#issuecomment-2270016609

Turns out I was wrong.

Test Plan: ci

Differential Revision: D60990594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133042
Approved by: https://github.com/hl475, https://github.com/desertfire
2024-08-09 18:53:27 +00:00
Danielmic
32f9a809c7 Replace [[unlikely]] with unlikely(x) (#130816)
Do not use `[[unlikely]]` as its c++20 language features, see https://en.cppreference.com/w/cpp/language/attributes/likely

Fixes https://github.com/pytorch/pytorch/issues/130815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130816
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet
2024-08-07 10:38:13 +00:00
Henry Tsang
e98eac76b3 [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#132766)
Summary: This is basically https://github.com/pytorch/pytorch/pull/131304 together with https://github.com/pytorch/pytorch/pull/132594 and absolute path fix for fbcode.

Test Plan: ci

Differential Revision: D60773405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132766
Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire
2024-08-06 23:56:34 +00:00
eellison
18b678082e [Easy] log output code path on cache hit (#132718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132718
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-08-06 21:59:30 +00:00
Gabriel Ferns
c3ee07c71c add missing profiler include in cpp code generation (#132419)
Summary:
When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check
 config.profiler_mark_wrapper_call.

Test Plan:
This case is already covered in test_profiler_mark_wrapper_call.

```
(pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu
stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 8.080s

OK
```

Fixes https://github.com/pytorch/pytorch/issues/131339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-05 13:40:47 +00:00
Oguz Ulgen
09f9c256ad Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-04 18:43:37 +00:00
PyTorch MergeBot
f2ddd5e9e0 Revert "Add basic mypy annotations to inductor (#132416)"
This reverts commit 78927d37f6.

Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
Xuehai Pan
f7aeb394b6 [BE][Easy] Remove empty ISORT_SKIPLIST (#132572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132572
Approved by: https://github.com/ezyang, https://github.com/justinchuby
ghstack dependencies: #129769
2024-08-04 10:24:09 +00:00
Sam Larsen
b71cd149ce Fix file lock issue in AotCodeCompiler (#132343)
Summary:
It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this:

- The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z`
- The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock`
- The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp`
- The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin`

So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths.

Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it.

Differential Revision: D60552021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343
Approved by: https://github.com/desertfire
2024-08-02 19:01:37 +00:00
Edward Z. Yang
290f09f829 Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-02 12:00:46 +00:00
PyTorch MergeBot
c8958f8f84 Revert "Ban decorator usage of dynamo_timed (#132328)"
This reverts commit 9853c048eb.

Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](9853c048eb).  Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))
2024-08-01 20:20:28 +00:00
Oguz Ulgen
78927d37f6 Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-01 20:14:25 +00:00
Edward Z. Yang
9853c048eb Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-01 19:27:58 +00:00
eellison
f32ab3b9e3 Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00