During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
ghstack dependencies: #133362, #133363
I have worked with @henrylhtsang to switch the cpp_builder to new one. We have removed the dependency to the old implementation.
So, it is time to remove the old implementation now. This PR is done the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133161
Approved by: https://github.com/ezyang
Summary:
When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check
config.profiler_mark_wrapper_call.
Test Plan:
This case is already covered in test_profiler_mark_wrapper_call.
```
(pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu
stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 8.080s
OK
```
Fixes https://github.com/pytorch/pytorch/issues/131339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419
Approved by: https://github.com/jgong5, https://github.com/desertfire
Summary:
It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this:
- The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z`
- The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock`
- The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp`
- The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin`
So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths.
Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it.
Differential Revision: D60552021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343
Approved by: https://github.com/desertfire
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that:
- When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification).
- When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch.
What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it **returns exactly what compile_fx_inner returns**. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner.
## What's a post compile step?
We define a **post-compile** to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include:
- Setting the tracing context's output strides
- Running cudagraphs if enabled
- Maybe realign inputs if cudagraphs didn't run
To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object.
## Splitting cudagraphs work into pre/post compile
Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages.
Implementation notes:
- We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness.
- ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache **doesn't** have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so *only* inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result.
- Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572
Approved by: https://github.com/eellison
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary: Found this "cannot find -ltorch: No such file or directory" issue when collecting AOTI CPU perf for the dashboard. Debugging on the CI machine revealed two problems: 1) no valid VEC_ISA was picked; 2) when 1 happens, libtorch path is not specified in the linker path.
This PR fixes the second problem. A later PR will fix the first problem, but somehow finding the right VEC_ISA causes a performance regression, which needs more investigation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131791
Approved by: https://github.com/zou3519, https://github.com/chenyang78
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other.
Differential Revision: D60047752
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345
Approved by: https://github.com/aorenste