Summary:
Optimize the shape padding in the following perspectives:
- Add BFloat16 support for AMP training and Float16 support for inference
- Optimize microbenchmark to avoid peak memory issue, and include profiling memory ops to make more accurate decision
- Add a flag to turn off/on padding dims N and M in `torch.bmm` due to expensive memory copy of `.contiguous` to avoid peak memory issues in internal models
Test Plan: CI
Differential Revision: D41724868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90425
Approved by: https://github.com/jianyuh
- Adds `log_level` to aot's config
- Outputs log to `<graph_name>_<log_level>.log` in aot_torchinductor subfolder of the debug directory
- Modifies the Inductor debug context to use the graph name when naming the folder instead of the os pid
- Adds `TORCH_COMPILE_DEBUG` flag to enable it, (as well as separate dynamo and inductor logs)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88987
Approved by: https://github.com/Chillee
Summary:
importing torch.fb seemed like a good idea, but we don't always have
torch.fb inside fbcode. Testing for torch.version.git_version is more
reliable, since we'll never have a git_version inside fbcode, which is an hg
repo.
Test Plan: `buck2 run mode/dev-nosan //caffe2/test/inductor:smoke`
Reviewed By: soumith, jansel
Differential Revision: D41777058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90312
Approved by: https://github.com/soumith
This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly.
Original commit: #88384 (011452a2a1)
Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3)
Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): cf3c3f2280
Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055
Approved by: https://github.com/DanilBaibak, https://github.com/malfet
`torch.compile` can be used either as decorator or to optimize model directly, for example:
```
@torch.compile
def foo(x):
return torch.sin(x) + x.max()
```
or
```
mod = torch.nn.ReLU()
optimized_mod = torch.compile(mod, mode="max-autotune")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89607
Approved by: https://github.com/soumith
This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674
Approved by: https://github.com/jansel
`config.compile_threads` gets the number of compile threads via `min(32,os.cpu_count())` while `os.cpu_count()` is the total number of cpu cores in the system, not the available ones. This would cause compile thread contention when the available cpu cores are less than `min(32,os.cpu_count())`, e.g., available cpu cores are limited with numactl or taskset, making the compilation very slow. This PR tries to use `len(os.sched_getaffinity(0))` if `os.sched_getaffinity` is available which returns the available number of cpu cores.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/89377
Approved by: https://github.com/soumith
- Propagates origin fx nodes through inlining during lowering
- Concatenates op names into kernel name
- Adds config to cap the number of ops in the kernel name so they don't get too long
Caveats:
- The ordering in the name may not match the order that the ops are executed in the kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624
Approved by: https://github.com/anijain2305, https://github.com/jansel
Fixes https://github.com/pytorch/torchdynamo/issues/1599
Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with.
In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds.
Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node.
I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one.
@ngimel @jansel @Chillee
cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447
Approved by: https://github.com/jansel
To cooperate with other multithreading methods, this
forces the process pool to use 'fork' even if others have set it
diferently. We require fork because otherwise `if __name__ == __main__`
needs to be set which we do not control as a library.
Furthermore this adds code to cleanup worker processes if
the parent exits abnormally (e.g. segfault). Previously we would leave
live but inactive workers around.
cc @jansel @lezcano @fdrocha
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87411
Approved by: https://github.com/soumith, https://github.com/anijain2305
Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork.
Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down.
This just moves the needed properties from the main trainer process to the workers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101
Approved by: https://github.com/soumith
https://github.com/pytorch/pytorch/pull/87032 seems to have an issue that breaks our benchmark script, it might have to do with the benchmark script also using subprocess.
Before this PR:
```
$ ./benchmarks/dynamo/torchbench.py --performance --inductor --raise --training --float16
...
Traceback (most recent call last):
File "/home/jansel/conda/envs/pytorch/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 239, in _worker_compile
kernel = TritonCodeCache.load(source_code)
File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 234, in load
mod = PyCodeCache.load(source_code)
File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 212, in load
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_jansel/ij/cij7smji4sw2a56i4yz45bjkrosd2sb2raqnxzsxxpg4kwzuo2ta.py", line 5, in <module>
from torch._inductor.triton_ops.autotune import reduction
File "/home/jansel/pytorch/torch/_inductor/triton_ops/__init__.py", line 3, in <module>
if has_triton():
File "/home/jansel/pytorch/torch/_inductor/utils.py", line 38, in has_triton
return triton is not None and torch.cuda.get_device_capability() >= (7, 0)
File "/home/jansel/pytorch/torch/cuda/__init__.py", line 368, in get_device_capability
prop = get_device_properties(device)
File "/home/jansel/pytorch/torch/cuda/__init__.py", line 382, in get_device_properties
_lazy_init() # will define _get_device_properties
File "/home/jansel/pytorch/torch/cuda/__init__.py", line 228, in _lazy_init
raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
```
cc @zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87048
Approved by: https://github.com/soumith
This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation
workers.
Previously os.fork overhead and GIL contention limited the achieved
parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation,
and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway
since it is mostly in python.
In cold start situations, the time to get the worker threads started can
be significant portion of the time.
This patch starts the workers earlier so they are ready to perform
compilation (see code comments) when dynamo
gets to that point.
Just tested this on one example benchmark (tf_efficientnet_b0), but the
results are significant, almost eliminating the difference between a
warm and cold compilation.
```
39.613s - warm
41.290s - cold, this patch
2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
(cold compilation is done after running `rm -rf
/tmp/torchinductor_$USER`).ompiling triton kernels
by using ProcessPoolExecutor to create persistent pool of compilation workers.
Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces
the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread
for everything else. This other work couldn't be parallelized anyway since it is mostly in python.
In cold start situations, the time to get the worker threads started can be significant portion of the time.
This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo
gets to that point.
Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation.
```
39.613s - warm
41.290s - cold, this patch
2m53.197s - cold, single threaded:
1m7.092s - cold, old setup n = 8 (its best config)
```
(cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032
Approved by: https://github.com/soumith, https://github.com/jansel