pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Peter Bell	9bd6e93a04	[inductor] Add option to create parent directory for write_atomic (#124646 ) In #124640 I see the error ``` File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 887, in load compiled_graph = FxGraphCache._lookup_graph(key, example_inputs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 776, in _lookup_graph write_atomic(artifact_path, graph.source_code) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py", line 412, in write_atomic with tmp_path.open(write_mode) as f: File "/opt/conda/envs/py_3.10/lib/python3.10/pathlib.py", line 1119, in open return self._accessor.open(self, mode, buffering, encoding, errors, FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp02wlik2v/iu/.28383.139931139675904.tmp' ``` Which is fixed by creating the parent directory first. Since this is what you want to do in most cases, I add an argument to `write_atomic` to do so itself. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124646 Approved by: https://github.com/lezcano	2024-04-24 20:12:23 +00:00
Jason Ansel	0792ceab4b	[dynamo] Refactor into torch/_inductor/runtime/compile_tasks.py (#124681 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124681 Approved by: https://github.com/masnesral ghstack dependencies: #124592	2024-04-23 17:51:25 +00:00
Jason Ansel	5d45eb77f1	[inductor] Remove usage of device_interface from _inductor.runtime (#124592 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124592 Approved by: https://github.com/masnesral	2024-04-23 17:51:25 +00:00
Edward Z. Yang	0bbbc754dd	Add AOTInductor generated cpp code to TORCH_TRACE (#124617 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124617 Approved by: https://github.com/albanD	2024-04-22 19:25:20 +00:00
Jason Ansel	7fd8870e6b	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 18:46:24 +00:00
PyTorch MergeBot	0b90af0bf5	Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 )" This reverts commit `fcf28b0ad5`. Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))	2024-04-22 18:28:05 +00:00
Jason Ansel	fcf28b0ad5	[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557 ) I am planning to make the compile_worker process not import torch so it can start up much faster. This stack is prep for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557 Approved by: https://github.com/yanboliang ghstack dependencies: #124552, #124553	2024-04-22 04:51:15 +00:00
Oguz Ulgen	0d64b82f0b	Make CompiledFxGraph portable between machines (#124438 ) As we prepare FxGraphCache to move to remote, we need to make sure there's no data that is on the disk. Differential Revision: [D56363808](https://our.internmc.facebook.com/intern/diff/D56363808) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124438 Approved by: https://github.com/jansel	2024-04-20 05:26:14 +00:00
eellison	39fc280dce	Dont precompile already seen keys, limit epilogue choices (#122642 ) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030	2024-04-19 17:34:22 +00:00
eellison	136f8378e1	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-19 17:03:33 +00:00
Nikita Shulga	1ba85b34dd	[AOTI] Enbale mmaped weights when CUDA is used (#124346 ) By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346 Approved by: https://github.com/mikekgfb, https://github.com/desertfire	2024-04-19 04:47:27 +00:00
PyTorch MergeBot	2b82345e48	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `030bb13fe8`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))	2024-04-18 07:21:41 +00:00
Zhuoran Zhao	8ad66e05d2	[4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123 ) Summary: as title Test Plan: CI & unit test Differential Revision: D56163334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-04-18 04:19:37 +00:00
eellison	030bb13fe8	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-18 01:22:13 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
PyTorch MergeBot	3f89f565bb	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `d68196e7ef`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))	2024-04-17 11:31:33 +00:00
PyTorch MergeBot	77ad630f5d	Revert "Dont precompile already seen keys, limit epilogue choices (#122642 )" This reverts commit `050051f412`. Reverted https://github.com/pytorch/pytorch/pull/122642 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2061044960))	2024-04-17 11:31:32 +00:00
eellison	050051f412	Dont precompile already seen keys, limit epilogue choices (#122642 ) Two changes: - in epilogue benchmark fusion, only take top 6 choices. There were basically no choices taken after this in HF. - Share a single precompilation function among matmuls with same key. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122642 Approved by: https://github.com/shunting314 ghstack dependencies: #124030	2024-04-17 03:08:59 +00:00
eellison	d68196e7ef	Re-land precompile triton templates (#124030 ) Re-land precompile triton templates. This got reverted because we were precompiling templates without checking the cache. I have since added logic and a test to ensure we do not precompile if there is a cache hit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124030 Approved by: https://github.com/shunting314, https://github.com/nmacchioni, https://github.com/yoyoyocmu	2024-04-17 02:30:46 +00:00
Oguz Ulgen	1fd9e320ea	Remove unnecessary FileLock in Fx Graph Cache (#124212 ) Writing to file happens via `write_atomic`, there's no need to take a global lock on the file system. This is likely creating unnecessary waits. Differential Revision: [D56208628](https://our.internmc.facebook.com/intern/diff/D56208628/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124212 Approved by: https://github.com/masnesral, https://github.com/eellison	2024-04-17 01:02:41 +00:00
Rohan	72271fb07e	Add NEON ISA support on aarch64 (#123584 ) Fixes #104729 This improves the compiled mode performance of Softmax (by 20%) and other operations (like batchnorm) that invoke the reduce_all function. Thereby also improves BERT inference by around 8%. Tested on a graviton 3 instance (c7g.4xl). Tests were run in a single-threaded manner. Script attached below. Command: `OMP_NUM_THREADS=1 LRU_CACHE_CAPACITY=1024 DNNL_DEFAULT_FPMATH_MODE=BF16 python TestSoftmax.py` [TestSoftmax.txt](https://github.com/pytorch/pytorch/files/14910754/TestSoftmax.txt) ```python import torch import torch.nn as nn from torch.profiler import profile, record_function, ProfilerActivity model = nn.Softmax().eval() compiled_model = torch.compile(model) inputs = torch.randn(1024, 1024) with torch.set_grad_enabled(False): for _ in range(50): compiled_model(inputs) #Warmup print("Warmup over") with profile(activities=[ProfilerActivity.CPU]) as prof: with record_function("model_inference"): for _ in range(100): compiled_model(inputs) print(prof.key_averages().table(sort_by="self_cpu_time_total")) # Check if the compiled model inference and the eager model inference are similar using torch.allclose print(torch.allclose(compiled_model(inputs), model(inputs))) ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123584 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-04-16 18:49:52 +00:00
Sam Larsen	6babf00014	[inductor] Bypass FX graph cache when we have HigherOrderOperators (#123325 ) Summary: The initial motivation was to avoid caching when we have triton higher order ops, but it's probably safer to avoid the cache for all higher order ops and allow/implement if/when we find it necessary. Test Plan: Unit test cribbed from: https://docs-preview.pytorch.org/pytorch/tutorials/2783/recipes/torch_compile_user_defined_triton_kernel_tutorial.html?highlight=triton Pull Request resolved: https://github.com/pytorch/pytorch/pull/123325 Approved by: https://github.com/eellison	2024-04-16 02:51:49 +00:00
Kai Londenberg	aaad0554b4	[Inductor] Fix endless recursion in codecache.DLLWrapper.__getattr__ (#123931 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123931 Approved by: https://github.com/peterbell10	2024-04-16 00:52:21 +00:00
Sam Larsen	e5b404b809	[inductor] Fix fresh_inductor_cache() (#122661 ) Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts. Test Plan: - New unit test - All existing inductor tests will exercise fresh_inductor_cache() Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661 Approved by: https://github.com/oulgen	2024-04-15 20:28:54 +00:00
Jason Ansel	285c93d64d	[inductor] Write generated files from parent process (#123409 ) Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409 Approved by: https://github.com/desertfire	2024-04-13 06:31:28 +00:00
PyTorch MergeBot	d994d993c0	Revert "[inductor] Fix fresh_inductor_cache() (#122661 )" This reverts commit `cda383e7bc`. Reverted https://github.com/pytorch/pytorch/pull/122661 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122661#issuecomment-2051171028))	2024-04-12 07:26:50 +00:00
PyTorch MergeBot	e881d567f4	Revert "[inductor] Write generated files from parent process (#123409 )" This reverts commit `79c565b24e`. Reverted https://github.com/pytorch/pytorch/pull/123409 on behalf of https://github.com/DanilBaibak due to Needs to be reverted because it blocks reverting of the broken PR. ([comment](https://github.com/pytorch/pytorch/pull/123409#issuecomment-2051166617))	2024-04-12 07:23:57 +00:00
Jason Ansel	79c565b24e	[inductor] Write generated files from parent process (#123409 ) Before this PR we would pass generated source code over a pipe to the compile worker then the compile worker would write out the file. Doing it this way is faster and results in smaller messages to the workers (and lets us skip creating the workers in the warm start case). Pull Request resolved: https://github.com/pytorch/pytorch/pull/123409 Approved by: https://github.com/desertfire	2024-04-11 17:39:16 +00:00
Nikita Shulga	416f532753	[AOTI] Serialize large weights (#123002 ) But appending them to the end of the shared library and mmaping afterwards Disabled by default, but overridable by `config.aot_inductor.force_mmap_weights` Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question Added unites to validate that it works as expected TODO: - Extend support to CUDA - munmap region if the same library is reused Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb	2024-04-11 06:39:58 +00:00
chunyuan	0d0fd80033	[AOTI] fix relocation overflow error when .data is large (#123639 ) https://github.com/pytorch/pytorch/pull/123164 removed the below code (so that constants are not readonly) to support module buffer mutation: `a9a9ce6d9c/torch/_inductor/codecache.py (L1685-L1691)` However, it may cause relocation overflow when the `.data` section is large. Below is part of the output from `ld --versbose` (`GNU ld (GNU Binutils for Ubuntu) 2.38`). `.data` is in between `.text` and `.bss`. When `.data` is too large, during the linking, the relocation of `.text` against `.bss` may overflow. Rename it to `.ldata` (perhaps that's why previously `.lrodata` instead of `.rodata` is used) so that it won't be in between the `.text` and `.bss` section ``` .text .rodata .data .bss .lrodata .ldata ``` We met this issue when fixing https://github.com/pytorch/pytorch/issues/114450 and running the below models on CPU: - AlbertForMaskedLM - AlbertForQuestionAnswering - BlenderbotForCausalLM - DebertaV2ForMaskedLM - DebertaV2ForQuestionAnswering - XGLMForCausalLM Pull Request resolved: https://github.com/pytorch/pytorch/pull/123639 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-04-11 01:37:43 +00:00
Sam Larsen	cda383e7bc	[inductor] Fix fresh_inductor_cache() (#122661 ) Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts. Test Plan: - New unit test - All existing inductor tests will exercise fresh_inductor_cache() Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661 Approved by: https://github.com/oulgen	2024-04-10 20:38:56 +00:00
PyTorch MergeBot	a65e9a06f0	Revert "[AOTI] Serialize large weights (#123002 )" This reverts commit `27eb5daee4`. Reverted https://github.com/pytorch/pytorch/pull/123002 on behalf of https://github.com/DanilBaibak due to There is conflict to land the diff internally ([comment](https://github.com/pytorch/pytorch/pull/123002#issuecomment-2048215990))	2024-04-10 18:54:31 +00:00
Nikita Shulga	27eb5daee4	[AOTI] Serialize large weights (#123002 ) But appending them to the end of the shared library and mmaping afterwards Disabled by default, but overridable by `config._force_mmap_aoti_weights` Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question Added unites to validate that it works as expected TODO: - Extend support to CUDA - munmap region if the same library is reused Co-authored-by: Michael Gschwind <61328285+mikekgfb@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb	2024-04-09 22:18:57 +00:00
brothergomez	a96e4ad0d1	[Inductor] Pass device interface to the worker compile (#122492 ) Summary: In `codecache.py` pass the device_interface directly to `_worker_compile()` instead of calling `get_device_interface()` from inside the function. If the device_interface is registered by an out-of-tree module then it will only be registered inside the main process and not inside the worker process. This fixes this issue. Happy to add a test if required. Test plan: No tests added Co-authored-by: brothergomez <brothergomez@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122492 Approved by: https://github.com/ezyang	2024-04-09 21:23:33 +00:00
Jez Ng	1b9eebb6bb	[AOTI] Handle null outputs (#123460 ) Summary: I skipped over the codegen for output handle assignment if the outputs are null -- in addition to being redundant, it was causing compile errors. I also modified the runtime to do the necessary null checks. Fixes #123173. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123460 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-04-08 23:07:03 +00:00
PyTorch MergeBot	a808559fc6	Revert "[inductor] Fix fresh_inductor_cache() (#122661 )" This reverts commit `ba7d396eb7`. Reverted https://github.com/pytorch/pytorch/pull/122661 on behalf of https://github.com/clee2000 due to new test is failing internally ([comment](https://github.com/pytorch/pytorch/pull/122661#issuecomment-2037977934))	2024-04-04 18:55:55 +00:00
Sam Larsen	ba7d396eb7	[inductor] Fix fresh_inductor_cache() (#122661 ) Summary: Modify fresh_inductor_cache() to clear cached state before mocking the toplevel cache_dir directory. Any lru_caches (or otherwise) can use the @clear_on_fresh_inductor_cache decorator to register the cache for clearing. Also change the base inductor TestCase class to use fresh_inductor_cache(). Previously that TestCase was only mocking the subdirectory within the toplevel cache dir designated for the FX graph cache artifacts. Test Plan: - New unit test - All existing inductor tests will exercise fresh_inductor_cache() Pull Request resolved: https://github.com/pytorch/pytorch/pull/122661 Approved by: https://github.com/oulgen	2024-04-04 02:32:37 +00:00
Kai Londenberg	f2e67179ee	[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 ) Minor changes which make the CUDA compilation within _inductor/codecache.py more robust and flexible. Test plan: CI Additional test in test_codecache.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490 Approved by: https://github.com/jansel	2024-04-03 12:56:48 +00:00
Bin Bao	0ff6155eee	[AOTI] Support module buffer mutation (#123164 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/120424. Because in a forward pass module buffers may be mutated, we need to allow that in AOTI. In addition, this will be a necessary step if we want to extend AOTI to training. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123164 Approved by: https://github.com/digantdesai, https://github.com/malfet, https://github.com/chenyang78, https://github.com/khabinov	2024-04-02 20:25:26 +00:00
Jason Ansel	3a9eead4ab	[inductor] Don't compile MultiKernelCall in a subprocess (#123010 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123010 Approved by: https://github.com/shunting314 ghstack dependencies: #123009	2024-03-30 05:46:09 +00:00
Bin Bao	375a8041ed	[AOTI][refactor] Improve logging (#122932 ) Summary: Improve some logging msgs, and change a data type to remove a compile time warning. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122932 Approved by: https://github.com/chenyang78	2024-03-29 14:02:23 +00:00
William Wen	2564f6cf0e	[dynamo, 3.12] Allocate Dynamo shadow frames by mimicking CPython (#122146 ) Python 3.12 changed a few things with how `_PyInterpreterFrame`s are allocated and freed: - Frames are now required to be placed on the Python frame stack. In 3.11, we could allocate frames anywhere in memory. In 3.12, we now need to use `THP_PyThreadState_BumpFramePointerSlow`/`push_chunk`/`allocate_chunk`. This method of allocating/freeing frames is also compatible with 3.11. - The eval frame function is now responsible for clearing the frame (see https://docs.python.org/3/whatsnew/changelog.html#id128, the point about "...which now clear the frame.") Pull Request resolved: https://github.com/pytorch/pytorch/pull/122146 Approved by: https://github.com/jansel	2024-03-27 20:39:39 +00:00
Bin Bao	537cd66e73	[Inductor] Support custom op in JIT with cpp wrapper (#122554 ) Summary: To call custom ops in an ABI-compatible way requires doing boxed call with varargs across C shim. In the JIT mode, we can get around it by calling into Python. https://gist.github.com/desertfire/be2a65b0a9b47780bb716b53ac2cd2b3 is an example of generated code. Differential Revision: [D55326556](https://our.internmc.facebook.com/intern/diff/D55326556) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122554 Approved by: https://github.com/jansel, https://github.com/chenyang78	2024-03-26 18:48:45 +00:00
Nikita Shulga	dd3f2cb53a	[Inductor] Add NEON ISA support on arm64 Macs (#122217 ) This started as a re-land of https://github.com/pytorch/pytorch/pull/105590 but focusing on enabling it on MacOS, but quickly turned into landing very limited platform-specific acceleration at this time (I.e. this PR does not add any NEON accelerated code at all, just enables vectorized compilation for the existing abstractions) Enabling the test harness, uncovered number of latent issues in CPU inductor that were fixed in the following PRS: - https://github.com/pytorch/pytorch/pull/122511 - https://github.com/pytorch/pytorch/pull/122513 - https://github.com/pytorch/pytorch/pull/122580 - https://github.com/pytorch/pytorch/pull/122608 Following was added/changed to enable vectorization code to work on MacOS - Added VecNEON class to `_inductor/codecache.py` that is supported on all AppleSilicon Macs - Added `Vectorized::loadu_one_fourth` to `vec_base.h`, and limit it to 8-bit types - Change 64-bit integral types mapping to `int64_t`/`uint64_t` to align with the rest of the code, as on MacOS, `int64_t` is a `long long` rather than `long` (see https://github.com/pytorch/pytorch/pull/118149 for more details) See table below for perf changes with and without torch.compile using [gpt-fast](https://github.com/pytorch-labs/gpt-fast) running `stories15M` on M2 Pro: \| dtype \| Eager \| Compile (before) \| Compile (after) \| \| ------ \| ------ \| --------- \| --------- \| \| bfloat16 \| 120 tokens/sec \| 130 tokens/sec \| 156 tokens/sec \| \| float32 \| 158 tokens/sec \| 140 tokens/sec \| 236 tokens/sec \| \| float16 \| 235 tokens/sec \| 81 tokens/sec \| 58 tokens/sec \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/122217 Approved by: https://github.com/jansel	2024-03-26 05:07:30 +00:00
Nikita Shulga	4758837930	[BE] Do not use `importlib.load_module` (#122542 ) To get rid of the annoying ``` <frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead ``` using recipe from https://docs.python.org/3/library/importlib.html#importing-a-source-file-directly Pull Request resolved: https://github.com/pytorch/pytorch/pull/122542 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-03-23 17:22:26 +00:00
PyTorch MergeBot	3795ebe925	Revert "[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 )" This reverts commit `6bbd697306`. Reverted https://github.com/pytorch/pytorch/pull/121490 on behalf of https://github.com/huydhn due to Sorry for reverting you change but I think it is failing on ROCm, i.e. `700c92e1b9` ([comment](https://github.com/pytorch/pytorch/pull/121490#issuecomment-2015829464))	2024-03-22 20:11:47 +00:00
Kai Londenberg	6bbd697306	[Inductor] Make codecache CUDA compilation more robust & flexible (#121490 ) Minor changes which make the CUDA compilation within _inductor/codecache.py more robust and flexible. Test plan: CI Additional test in test_codecache.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/121490 Approved by: https://github.com/jansel	2024-03-22 08:12:11 +00:00
Bert Maher	ea6f67853e	[inductor fbcode] Add python include paths for Python.h (#122363 ) Summary: We're getting errors that Python.h is not found because we didn't have the proper include path set up for it. bypass-github-export-checks Test Plan: I can only get this to show up in Bento: N5106134 Reviewed By: hl475, chenyang78 Differential Revision: D55133110 Co-authored-by: Bert Maher <bertrand@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122363 Approved by: https://github.com/bertmaher	2024-03-21 04:32:17 +00:00
eellison	cbbed46377	Defer selection of triton template (#120275 ) Our prior approach to epilogue fusion was to select from a choice from a set of triton templates and extern calls based on benchmarking inputs, then unconditionally fuse epilogues. This can be sub-optimal in following ways: - We select an extern kernel, however an epilogue like relu() exists such that choosing a triton template + relu would have been faster - We select a triton template, epilogue fuse, and register spilling occurs causing it to be slower than not epilogue fusing. In this PR we wait to select either the Triton Template or Extern Kernel based on benchmarking results from the kernel itself and its epilogue. As soon as a successful fusion occurs where a fused Triton Template + epilogue is faster than the unfused choice we finalize the MultiTemplateBuffer as a specific template. If no fusion occurs we'll finalize the MultiTemplateBuffer after fusion. Note: if there are multiple epilogue fusions (not super likely), even though we select a template after the first fusion, we will still benchmark to see if subsequent epilogue are worth fusing. We could potentially defer choosing template in this case in a follow up at expense of compile time. Gives 4% HF training win, 10% TIMM inference win. Increases compilation time which I will be trying to address more in follow up prs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120275 Approved by: https://github.com/jansel ghstack dependencies: #121996	2024-03-20 01:40:33 +00:00
Han, Xu	09ce76809c	Improve compiler detection on MacOS (#121406 ) By relying on `is_apple_clang` helper function rather than on compiler name (as `gcc` is clang on MacOS): ``` % which gcc; gcc -v /usr/bin/gcc Apple clang version 15.0.0 (clang-1500.3.9.4) Target: arm64-apple-darwin23.3.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin ``` But ``` % /opt/homebrew/bin/gcc-13 -v Using built-in specs. COLLECT_GCC=/opt/homebrew/bin/gcc-13 COLLECT_LTO_WRAPPER=/opt/homebrew/Cellar/gcc/13.2.0/bin/../libexec/gcc/aarch64-apple-darwin23/13/lto-wrapper Target: aarch64-apple-darwin23 Configured with: ../configure --prefix=/opt/homebrew/opt/gcc --libdir=/opt/homebrew/opt/gcc/lib/gcc/current --disable-nls --enable-checking=release --with-gcc-major-version-only --enable-languages=c,c++,objc,obj-c++,fortran --program-suffix=-13 --with-gmp=/opt/homebrew/opt/gmp --with-mpfr=/opt/homebrew/opt/mpfr --with-mpc=/opt/homebrew/opt/libmpc --with-isl=/opt/homebrew/opt/isl --with-zstd=/opt/homebrew/opt/zstd --with-pkgversion='Homebrew GCC 13.2.0' --with-bugurl=https://github.com/Homebrew/homebrew-core/issues --with-system-zlib --build=aarch64-apple-darwin23 --with-sysroot=/Library/Developer/CommandLineTools/SDKs/MacOSX14.sdk Thread model: posix Supported LTO compression algorithms: zlib zstd gcc version 13.2.0 (Homebrew GCC 13.2.0) ``` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121406 Approved by: https://github.com/malfet, https://github.com/jansel	2024-03-19 05:32:08 +00:00

1 2 3 4 5 ...

273 Commits