pytorch/torch/_inductor/codegen
Elias Ellison f18f0c70ab Dont clone unmutated args in triton autotuning (#89519)
Improves first memory compression on pytorch struct from .55 -> .73. However, it doesn't totally eliminate the overhead from autotuning. Any other pointers on where the overhead is coming from in autotuning would be great.

Edit: i think it's just the triton cache clearing 44f577984d/python/triton/testing.py (L159)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/89519
Approved by: https://github.com/ngimel, https://github.com/jansel
2022-11-23 22:00:03 +00:00
..
__init__.py
autotuner.py
common.py [inductor] Fix nan handling for aten.sign (#88937) 2022-11-21 20:56:40 +00:00
cpp_prefix.h Support masked_fill to address the GPT2 performance issue (#89274) 2022-11-22 04:12:43 +00:00
cpp.py [inductor] generate nan in the cpp backend (#89289) 2022-11-22 15:54:04 +00:00
triton_conv_delta_x_hwc.j2
triton_conv_delta_x.j2
triton_mm.j2
triton_template.py [inductor] fix could not find as_strided with config.triton.mm=triton (#88946) 2022-11-15 00:48:49 +00:00
triton.py Dont clone unmutated args in triton autotuning (#89519) 2022-11-23 22:00:03 +00:00
wrapper.py Have kernel names include fused ops (#88624) 2022-11-10 21:38:06 +00:00