pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Bert Maher	d3d85e1c3b	Emit torch.cuda.synchronize() after every kernel call in inductor (#90472 ) Debugging illegal memory access is hard; even CUDA_LAUNCH_BLOCKING=1 and using C10_CUDA_KERNEL_LAUNCH_CHECK doesn't guarantee a useful stack trace. doesn't necessarily guarantee that you'll get a stack trace pointing to the right kernel. This diff adds a config option to force a CUDA synchronize after every kernel call in inductor, for debugging those tricky cases. Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967/) Differential Revision: [D41744967](https://our.internmc.facebook.com/intern/diff/D41744967) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90472 Approved by: https://github.com/jansel	2022-12-12 04:35:10 +00:00
Jiawen Liu	4a1633ca69	[Inductor] GEMM Shape Padding Optimization (#90425 ) Summary: Optimize the shape padding in the following perspectives: - Add BFloat16 support for AMP training and Float16 support for inference - Optimize microbenchmark to avoid peak memory issue, and include profiling memory ops to make more accurate decision - Add a flag to turn off/on padding dims N and M in `torch.bmm` due to expensive memory copy of `.contiguous` to avoid peak memory issues in internal models Test Plan: CI Differential Revision: D41724868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90425 Approved by: https://github.com/jianyuh	2022-12-09 22:48:02 +00:00
Michael Lazos	730e44bbc7	Add logging for aot autograd and unified debug flag (#88987 ) - Adds `log_level` to aot's config - Outputs log to `<graph_name>_<log_level>.log` in aot_torchinductor subfolder of the debug directory - Modifies the Inductor debug context to use the graph name when naming the folder instead of the os pid - Adds `TORCH_COMPILE_DEBUG` flag to enable it, (as well as separate dynamo and inductor logs) Pull Request resolved: https://github.com/pytorch/pytorch/pull/88987 Approved by: https://github.com/Chillee	2022-12-09 17:28:10 +00:00
PyTorch MergeBot	6581063583	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" This reverts commit `db0ce4acf3`. Reverted https://github.com/pytorch/pytorch/pull/88384 on behalf of https://github.com/malfet due to Broke test_public_bindings across the board	2022-12-09 16:32:25 +00:00
Mark Saroufim	db0ce4acf3	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-12-09 04:32:31 +00:00
Bert Maher	26d1dbc4f8	[inductor] More correct check for fbcode environment (#90312 ) Summary: importing torch.fb seemed like a good idea, but we don't always have torch.fb inside fbcode. Testing for torch.version.git_version is more reliable, since we'll never have a git_version inside fbcode, which is an hg repo. Test Plan: `buck2 run mode/dev-nosan //caffe2/test/inductor:smoke` Reviewed By: soumith, jansel Differential Revision: D41777058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/90312 Approved by: https://github.com/soumith	2022-12-07 04:50:11 +00:00
Michael Voznesensky	5423c2f0e2	Light refactor to how we get shape_env for graph lowering (#90139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90139 Approved by: https://github.com/ezyang	2022-12-05 18:35:30 +00:00
Jean Schmidt	f62e54df8f	Reland "Dynamo, FX, Inductor Progress Bars (#88384 )" … (#90055 ) This commit had inconsistent internal land and pr merged. This caused merge conflicts that required revert in both places, normalize the internal commit stack, and then re-land properly. Original commit: #88384 (`011452a2a1`) Inconsistent revert: #90018 (8566aa7c0b4bdca50bf85ca14705b4304de030b3) Revert of the inconsistent revert to restore healthy state (or re-land of the original commit): `cf3c3f2280` Landing the correct, internally congruent revert of the original commit: (This PR) #90055 (TBD) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90055 Approved by: https://github.com/DanilBaibak, https://github.com/malfet	2022-12-02 13:28:00 +00:00
PyTorch MergeBot	cf3c3f2280	Revert "Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 )" This reverts commit `bcf4292f04`. Reverted https://github.com/pytorch/pytorch/pull/90018 on behalf of https://github.com/jeanschmidt due to landed internal commit does not match with this one, causing merge conflict and preventing import and land new commits	2022-12-02 09:57:31 +00:00
Nikita Shulga	768bd3fb4a	Add `torch.compile` implementation (#89607 ) `torch.compile` can be used either as decorator or to optimize model directly, for example: ``` @torch.compile def foo(x): return torch.sin(x) + x.max() ``` or ``` mod = torch.nn.ReLU() optimized_mod = torch.compile(mod, mode="max-autotune") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/89607 Approved by: https://github.com/soumith	2022-12-01 20:17:52 +00:00
Eli Uriegas	bcf4292f04	Revert "Dynamo, FX, Inductor Progress Bars (#88384 )" (#90018 ) This breaks in environments that use the fake tqdm `015b05af18/torch/hub.py (L26)` which doesn't support the 'desc' kwarg and is not iterable Original try using pytorchbot did not go through because of a merge conflict: https://github.com/pytorch/pytorch/pull/88384#issuecomment-1334272489 This reverts commit `011452a2a1`. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/90018 Approved by: https://github.com/drisspg, https://github.com/dbort	2022-12-01 20:17:07 +00:00
Bert Maher	6317311e61	[inductor] Disable parallel compilation inside fbcode (#89926 ) Forking python processes using `multiprocessing` doesn't play nicely with certain aspects of FB infra, so let's disable it until we find a better solution. Differential Revision: [D41618774](https://our.internmc.facebook.com/intern/diff/D41618774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/89926 Approved by: https://github.com/desertfire	2022-12-01 02:33:45 +00:00
Wu, Chunyuan	a6caa9c54b	Add a cpp wrapper for Inductor (#88167 ) ## Description Implements https://github.com/pytorch/torchdynamo/issues/1556. This PR adds a cpp wrapper to invoke the generated kernels. The cpp wrapper is turned off by default and can be turned on by setting: ```python from torch._inductor import config config.cpp_wrapper = True ``` ### Example The main part of the generated code: ```python from torch.utils.cpp_extension import load_inline wrapper = ( ''' #include <dlfcn.h> #include <assert.h> std::tuple<at::Tensor, at::Tensor> call_0(std::tuple<at::Tensor, at::Tensor> args) { at::Tensor arg0_1, arg1_1; std::tie(arg0_1, arg1_1) = args; auto buf0 = at::empty_strided({8, 8}, {8, 1}, at::ScalarType::Float); auto buf1 = at::empty_strided({8, 8}, {1, 8}, at::ScalarType::Float); auto kernel0_lib = dlopen("/tmp/torchinductor_user/kn/ckn7ubcn2qbkme2vx5r6antnh5sv6d3o3t6qwdfgfoupnxty6pnm.so", RTLD_NOW); assert(kernel0_lib != nullptr); void (kernel0)(const float,const float,float,float); (void *) (&kernel0) = dlsym(kernel0_lib, "kernel"); kernel0((float)(arg0_1.data_ptr()), (float)(arg1_1.data_ptr()), (float)(buf0.data_ptr()), (float*)(buf1.data_ptr())); arg0_1.reset(); arg1_1.reset(); return std::make_tuple(buf0, buf1); }''' ) module = load_inline( name='inline_extension_c64wpbccpbre3th2k6oxwrjy5bhvxnmkdxkhcfxlsw7xpsg4eabu', cpp_sources=[wrapper], functions=['call_0'], extra_cflags=['-fPIC -Wall -std=c++14 -Wno-unused-variable -march=native -O3 -ffast-math -fno-finite-math-only -fopenmp'], extra_ldflags=['-shared -lgomp'], extra_include_paths=['-I/home/user/pytorch/torch/include -I/home/user/pytorch/torch/include/torch/csrc/api/include -I/home/user/pytorch/torch/include/TH -I/home/user/pytorch/torch/include/THC -I/home/user/miniconda3/envs/pytorch/include/python3.7m']) def _wrap_func(f): def g(args): return f(args) return g call = _wrap_func(module.call_0) ``` ### Next steps The below items will be addressed in upcoming PRs. - [x] Support Reduction: #88561 - [x] Support None: #88560 - [ ] Support ExternKernel - [x] ATen GEMM-related OPs: #88667 - [ ] ATen Conv - [ ] Conv/GEMM fusion OPs - [x] Cache the kernel loading part: #89742 - [ ] De-allocate input buffers when possible by leveraging CPython APIs - [ ] Support Constant Pull Request resolved: https://github.com/pytorch/pytorch/pull/88167 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire	2022-11-30 13:40:47 +00:00
Mark Saroufim	011452a2a1	Dynamo, FX, Inductor Progress Bars (#88384 ) There are 3 progress bars each gated behind their own config, all off by default for now 1. Dynamo: Macro level config for dynamo, AOT, inductor 2. FX: Progress bar for each pass, with their names 3. Inductor Pull Request resolved: https://github.com/pytorch/pytorch/pull/88384 Approved by: https://github.com/wconstab, https://github.com/mlazos	2022-11-30 06:07:14 +00:00
Jiong Gong	c75434ed4f	[Inductor] Add an option to mark wrapper call in PyTorch profiler (#89674 ) This PR adds an option `config.profiler_mark_wrapper_call` (disabled by default) to mark the duration of wrapper call in the PyTorch profiler. This makes it easy to identify the duration and start/end of each wrapper call in the profiler output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89674 Approved by: https://github.com/jansel	2022-11-29 00:58:46 +00:00
Jiong Gong	bb77accb4c	[Inductor] Record cpp kernel in PyTorch Profiler (#89367 ) Add an option `config.cpp.enable_kernel_profile` to record individual cpp kernel time in PyTorch Profiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89367 Approved by: https://github.com/jansel	2022-11-26 14:06:44 +00:00
Natalia Gimelshein	3e20d023b1	put descriptive kernel names behind config (#89697 ) Per title, generated kernel names are often long and confusing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89697 Approved by: https://github.com/Chillee	2022-11-26 03:08:23 +00:00
Natalia Gimelshein	61a3fe4b64	make inductor correctly propagate nans for maximum and minimum (#89612 ) Partially fixes https://github.com/pytorch/torchdynamo/issues/594 Also, small cleanup for `where` codegen Pull Request resolved: https://github.com/pytorch/pytorch/pull/89612 Approved by: https://github.com/soumith, https://github.com/jansel	2022-11-25 19:42:38 +00:00
Jiong Gong	6796979ee1	[Inductor] Limit the number of compile threads to the available cpu cores (#89377 ) `config.compile_threads` gets the number of compile threads via `min(32,os.cpu_count())` while `os.cpu_count()` is the total number of cpu cores in the system, not the available ones. This would cause compile thread contention when the available cpu cores are less than `min(32,os.cpu_count())`, e.g., available cpu cores are limited with numactl or taskset, making the compilation very slow. This PR tries to use `len(os.sched_getaffinity(0))` if `os.sched_getaffinity` is available which returns the available number of cpu cores. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89377 Approved by: https://github.com/soumith	2022-11-21 14:20:36 +00:00
Jiawen Liu	5270122773	[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#89118 ) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: 1.15x -> 1.36x speedup Test Plan: CI Reviewed By: bertmaher, jansel, jianyuh Differential Revision: D41071665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/89118 Approved by: https://github.com/jianyuh	2022-11-16 10:37:30 +00:00
PyTorch MergeBot	9f0b2c73f3	Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859 )" This reverts commit `d60abe4b95`. Reverted https://github.com/pytorch/pytorch/pull/88859 on behalf of https://github.com/kit1980 due to Broke Mac OS testing, which were clearly shown in CI	2022-11-16 01:13:00 +00:00
Jiawen Liu	d60abe4b95	[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88859 ) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: 1.15x -> 1.36x speedup Differential Revision: D41071665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88859 Approved by: https://github.com/jianyuh, https://github.com/jansel	2022-11-15 19:34:38 +00:00
Jiawen Liu	55b88cde0a	[Inductor] Build Shape Padding in Inductor (#88709 ) Summary: Build shape padding for matmul/bmm/addmm in Inductor Differential Revision: D41071282 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88709 Approved by: https://github.com/bertmaher, https://github.com/Chillee	2022-11-15 03:10:36 +00:00
Nikita Shulga	f39cad50b7	Make InductorCPU usable in internally (#88870 ) Test Plan: `buck2 test mode/opt //caffe2/test:test_inductor -- --exact 'caffe2/test:test_inductor - test_dtype_mismatch_issue_cuda (caffe2.test.inductor.test_torchinductor.CudaTests)'` Differential Revision: D41206109 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88870 Approved by: https://github.com/izaitsevfb	2022-11-11 22:07:34 +00:00
Michael Lazos	c1553880de	Have kernel names include fused ops (#88624 ) - Propagates origin fx nodes through inlining during lowering - Concatenates op names into kernel name - Adds config to cap the number of ops in the kernel name so they don't get too long Caveats: - The ordering in the name may not match the order that the ops are executed in the kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/88624 Approved by: https://github.com/anijain2305, https://github.com/jansel	2022-11-10 21:38:06 +00:00
PyTorch MergeBot	29550e2c1d	Revert "[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566 )" This reverts commit `48b58930cb`. Reverted https://github.com/pytorch/pytorch/pull/88566 on behalf of https://github.com/huydhn due to This change breaks trunk `48b58930cb`	2022-11-10 20:56:30 +00:00
Jiawen Liu	48b58930cb	[Inductor] Build FX Linear + Permute Vertical Fusion in Inductor (#88566 ) Summary: Build fx-based linear/matmul/bmm + permute/transpose vertical fusion in Inductor For an internal Ads model: 1.15x -> 1.36x speedup Differential Revision: D41071665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/88566 Approved by: https://github.com/jansel, https://github.com/jianyuh	2022-11-10 18:32:25 +00:00
Animesh Jain	c4fecff97d	[inductor] Prevent aggressive fusion during inductor lowering (#87447 ) Fixes https://github.com/pytorch/torchdynamo/issues/1599 Inductor performs aggressive fusion of ops during the lowering of Fx graph into IR nodes. Note that this fusion is different from the fusion that we typically discuss in the context of Inductor, which refers to the fusion of SchedulerNodes (way after lowering). This PR, instead, ensures that we don't accumulate too many ops in the IR node to begin with. In the case of hf_t5_large backward graph, earlier we would generate a kernel with 100s of operators, causing that kernel to take ~350 seconds of compilation time. With this PR, we get it down from 350 seconds to 50 seconds. Note that this could affect performance. I doubt that it will lead to really large dip though. In my toy examples, even if the lowering creates multiple IR nodes, if its a simple fusion, later fusion still creates one node. I would like (1) test_torchinductor.py, (2) test_torchinductor_info.py, and (3) atleast HF models to be enabled in CI before merging this one. @ngimel @jansel @Chillee cc @jansel @lezcano @fdrocha @mlazos @soumith @voznesenskym @yanboliang @penguinwu Pull Request resolved: https://github.com/pytorch/pytorch/pull/87447 Approved by: https://github.com/jansel	2022-10-24 21:53:17 +00:00
Zachary DeVito	db83a0578c	[inductor] force 'fork' method for processes, cleanup (#87411 ) To cooperate with other multithreading methods, this forces the process pool to use 'fork' even if others have set it diferently. We require fork because otherwise `if __name__ == __main__` needs to be set which we do not control as a library. Furthermore this adds code to cleanup worker processes if the parent exits abnormally (e.g. segfault). Previously we would leave live but inactive workers around. cc @jansel @lezcano @fdrocha Pull Request resolved: https://github.com/pytorch/pytorch/pull/87411 Approved by: https://github.com/soumith, https://github.com/anijain2305	2022-10-21 17:06:56 +00:00
Horace He	2418ddb1ec	Unified symbolic shape variables between Inductor and AOTDispatcher (#87161 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87161 Approved by: https://github.com/jansel	2022-10-19 04:50:34 +00:00
Zachary DeVito	d36c284d14	[triton] allow cuda properties to be queried from workers (#87101 ) Fixes https://github.com/pytorch/pytorch/pull/87048 by saving the needed properties before fork. Actually attempting to get CUDA to load in the workers is probably not desired: cuda initialization takes O(seconds). Having multiple processes using the same device will slow things down. This just moves the needed properties from the main trainer process to the workers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87101 Approved by: https://github.com/soumith	2022-10-18 04:48:29 +00:00
Jiong Gong	78e2289005	[TorchInductor] enable inplace buffers by default (#87037 ) This PR enables the inplace_buffers configuration by default after fixing issue: https://github.com/pytorch/torchdynamo/issues/1670. UT is added to cover the fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/87037 Approved by: https://github.com/jansel	2022-10-17 06:05:30 +00:00
Jason Ansel	0379af681b	[inductor] Disable parallel compile (#87048 ) https://github.com/pytorch/pytorch/pull/87032 seems to have an issue that breaks our benchmark script, it might have to do with the benchmark script also using subprocess. Before this PR: ``` $ ./benchmarks/dynamo/torchbench.py --performance --inductor --raise --training --float16 ... Traceback (most recent call last): File "/home/jansel/conda/envs/pytorch/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(call_item.args, *call_item.kwargs) File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 239, in _worker_compile kernel = TritonCodeCache.load(source_code) File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 234, in load mod = PyCodeCache.load(source_code) File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 212, in load exec(code, mod.__dict__, mod.__dict__) File "/tmp/torchinductor_jansel/ij/cij7smji4sw2a56i4yz45bjkrosd2sb2raqnxzsxxpg4kwzuo2ta.py", line 5, in <module> from torch._inductor.triton_ops.autotune import reduction File "/home/jansel/pytorch/torch/_inductor/triton_ops/__init__.py", line 3, in <module> if has_triton(): File "/home/jansel/pytorch/torch/_inductor/utils.py", line 38, in has_triton return triton is not None and torch.cuda.get_device_capability() >= (7, 0) File "/home/jansel/pytorch/torch/cuda/__init__.py", line 368, in get_device_capability prop = get_device_properties(device) File "/home/jansel/pytorch/torch/cuda/__init__.py", line 382, in get_device_properties _lazy_init() # will define _get_device_properties File "/home/jansel/pytorch/torch/cuda/__init__.py", line 228, in _lazy_init raise RuntimeError( RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method ``` cc @zdevito Pull Request resolved: https://github.com/pytorch/pytorch/pull/87048 Approved by: https://github.com/soumith	2022-10-17 01:02:43 +00:00
Zachary DeVito	2b7236a0e1	[torchdynamo] Use ProcessPoolExecutor for triton compiles (#87032 ) This patch significantly improves the parallel compilation performance for cThis patch significantly improves the parallel compilation performance for compiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread for everything else. This other work couldn't be parallelized anyway since it is mostly in python. In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation. ``` 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) ``` (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`).ompiling triton kernels by using ProcessPoolExecutor to create persistent pool of compilation workers. Previously os.fork overhead and GIL contention limited the achieved parallelism. This patch replaces the worker threads with a pool of processes to do the raw compilation, and does serial work on the main thread for everything else. This other work couldn't be parallelized anyway since it is mostly in python. In cold start situations, the time to get the worker threads started can be significant portion of the time. This patch starts the workers earlier so they are ready to perform compilation (see code comments) when dynamo gets to that point. Just tested this on one example benchmark (tf_efficientnet_b0), but the results are significant, almost eliminating the difference between a warm and cold compilation. ``` 39.613s - warm 41.290s - cold, this patch 2m53.197s - cold, single threaded: 1m7.092s - cold, old setup n = 8 (its best config) ``` (cold compilation is done after running `rm -rf /tmp/torchinductor_$USER`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/87032 Approved by: https://github.com/soumith, https://github.com/jansel	2022-10-16 21:58:26 +00:00
Jason Ansel	c7c09722ad	Move TorchDynamo into PyTorch core (#86461 ) Context: https://github.com/pytorch/torchdynamo/issues/1588 This PR moves [TorchDynamo](https://github.com/pytorch/torchdynamo) and TorchInductor into PyTorch core. - `torchdynamo` becomes `torch._dynamo` - `torchinductor` becomes `torch._inductor` This PR was generated by running `copy_to_core.sh` in https://github.com/pytorch/torchdynamo/pull/1538 Pull Request resolved: https://github.com/pytorch/pytorch/pull/86461 Approved by: https://github.com/voznesenskym	2022-10-13 23:18:06 +00:00

35 Commits