pytorch/torch/utils
Prachi Gupta b8f4dc5a9f [ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264)
In this approach, we are catching any lane within a wave that is doing fastatomics to the same destination address and computing the sum on the CU. This is leading to 3x improvement in scatter_add performance and 2x improvement in index_select.

scatter_add performance on MI300x:
dtype|Baseline (before optimizations)|opportunistic fastatomics
-------|----------------------------------|----------------------------------
f32|1.389425039|0.430447996
fp16|2.195472956|0.779729486
bf16|2.194051027|0.784599513

Using the following reproducer
```
import torch
import triton

def main():
    dtype = torch.float32
    dim = 1305301
    a = torch.rand(100, device="cuda", dtype=dtype)
    index = torch.randint(0, 100, (dim,), device="cuda")
    src = torch.rand(dim, device="cuda", dtype=dtype)

    print("=" * 20)
    print(
        triton.testing.do_bench(
            lambda: a.scatter_add(0, index, src),
            return_mode="median",
        )
    )
    print("=" * 20)

if __name__ == "__main__":
    main()
```

co-authored by: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146264
Approved by: https://github.com/jeffdaily, https://github.com/mxz297

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-04-22 21:55:40 +00:00
..
_strobelight
_sympy Add ccode for CeilToInt and IntTrueDiv (#151375) 2025-04-16 16:47:55 +00:00
backcompat
benchmark [BE][Ez]: Use itertools.chain.from_iterable when possible (#148190) 2025-03-06 20:37:06 +00:00
bottleneck
data Optimize dataloader Self typing (#146816) 2025-04-08 03:52:23 +00:00
hipify [fbgemm_gpu] Incorporate Torch DSA (#151148) 2025-04-15 11:34:04 +00:00
jit
model_dump
serialization Make record/storage alignment in torch.save configurable (#147788) 2025-03-06 12:04:46 +00:00
tensorboard Define __all__ for torch.utils.tensorboard (#147550) 2025-02-28 23:06:11 +00:00
viz Fix ReferenceError: weakly-referenced object no longer exists in cycle detector (#146922) 2025-02-24 22:27:39 +00:00
__init__.py
_appending_byte_serializer.py [MegaCache] Encode key in base64 (#151472) 2025-04-17 17:12:22 +00:00
_backport_slots.py
_config_module.py Revert "[dynamo] context manager/decorator for dynamo config patching during tracing (#150586)" 2025-04-16 16:13:47 +00:00
_config_typing.pyi
_content_store.py Revert "Use the device interface for detecting Triton availability (#139171)" 2025-03-11 18:49:21 +00:00
_contextlib.py
_cpp_embed_headers.py
_cpp_extension_versioner.py
_cxx_pytree.py Gracefully handle optree less than minimum version, part 2 (#151257) 2025-04-15 13:08:26 +00:00
_device.py Remove torch functions that do not support device arguments from _device_constructor (#150290) 2025-04-08 15:13:55 +00:00
_exposed_in.py
_filelock.py
_foreach_utils.py [HPU] Add hpu to fused kernels supported devices (#148666) 2025-03-07 04:28:33 +00:00
_freeze.py
_functools.py
_get_clean_triton.py Reland: [inductor] Simplify grid handling (#148305) 2025-03-12 15:52:16 +00:00
_import_utils.py
_mode_utils.py
_ordered_set.py
_python_dispatch.py
_pytree.py Gracefully handle optree less than minimum version, part 2 (#151257) 2025-04-15 13:08:26 +00:00
_stats.py
_thunk.py
_traceback.py
_triton.py [Inductor] Remove triton dtype patch which has landed (#149611) 2025-04-10 03:42:55 +00:00
_typing_utils.py
_zip.py
backend_registration.py
bundled_inputs.py
checkpoint.py
collect_env.py collect_env: gracefully handle no pip (#151607) 2025-04-18 12:28:58 +00:00
cpp_backtrace.py
cpp_extension.py [ROCm] opportunistic fastatomics for ReduceAdd operations for MI300 GPUs (#146264) 2025-04-22 21:55:40 +00:00
deterministic.py
dlpack.py Add __all__ for torch.utils.dlpack (#149026) 2025-04-11 22:03:24 +00:00
file_baton.py Warn user of existing lock file to avoid infinite waiting (#149382) 2025-04-15 20:25:29 +00:00
flop_counter.py
hooks.py
mkldnn.py
mobile_optimizer.py
model_zoo.py
module_tracker.py
show_pickle.py
throughput_benchmark.py
weak.py