This has two fixes to improve IPC tensor release performance when using torchft's BabyProcessGroupNCCL.
1. release the IpcMutex when deleting the `ExpandableSegements` object to avoid synchronizing under the lock
2. release the GIL in WorkNCCL destructor since the shared tensor will be destructed there
Test plan:
Run with torchft + torchtitan
```
REPLICA_GROUP_ID=0 NGPU=2 CUDA_VISIBLE_DEVICES=0,1 CONFIG_FILE=./torchtitan/models/llama/train_configs/llama3_8b.toml ./run_train.sh --training.data_par
allel_shard_degree=2 --fault_tolerance.enable --fault_tolerance.group_size=2 --fault_tolerance.replica_id=0 --metrics.log_freq=1 --training.seq_len 4096
...
[rank0]:[titan] 2025-03-07 17:51:31,387 - root - INFO - step: 61 loss: 7.4825 memory: 79.73GiB(83.89%) tps: 317 tflops: 16.34 mfu: 1.65%
```
Check py-spy to verify no bottleneck on IPC lock when creating new shared tensors


Pull Request resolved: https://github.com/pytorch/pytorch/pull/148805
Approved by: https://github.com/Skylion007, https://github.com/fegin, https://github.com/zdevito
Summary:
LLVM has a warning `-Wunused-value` which we treat as an error because it's so often diagnostic of a code issue. Unused values often indicate a programming mistake, but can also just be unnecessary cruft that harms readability and performance.
For questions/comments, contact r-barnes.
- If you approve of this diff, please use the "Accept & Ship" button :-)
Test Plan: Sandcastle
Differential Revision: D69945678
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147555
Approved by: https://github.com/Skylion007, https://github.com/eqy
We switch to pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter for lintrunner for checking CUDA cpp source. Meanwhile, there is a Dockerfile change due to missing libiomp installation and some other clang-tidy fixes triggered by the switch.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110502
Approved by: https://github.com/malfet
Part of #91395
Also modifies how `StorageImpl`s are stored in JIT static runtime's `MemoryPlanner`, which used to `std::move` `StorageImpl`s into a vector. But `StorageImpl` can no longer be moved. Instead, `MemoryPlanner` now contains a malloced buffer to which we add new `StorageImpl`s using placement new
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93342
Approved by: https://github.com/ezyang
Summary:
This is part 1 of the effort to support `share_memory_()` in C++ aten library.
This allows C++ code to in place replace the tensor storage to shm based.
For now fd based shm is the only implementation supported to simplify memory management in general.
This first part intentionally avoids public api changes (to `TensorBase`, see comments in `StorageUtil.h`) such that we can get the core features usable outside pt/csrc first. The API addition to `Tensor` or `TensorBase` would involve more distracting changes and make the change harder to review.
Test Plan:
```
buck test caffe2:StorageUtils_test
```
Differential Revision: D43467616
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95228
Approved by: https://github.com/ezyang
See discussion here for context: https://pytorch.slack.com/archives/GEEQ2K4MD/p1663672716533319?thread_ts=1662155536.133099&cid=GEEQ2K4MD, opening a PR as suggested by @albanD
Currently PyTorch holds the GIL when copying Tensors into shared memory. For certain workloads it would be nice to be able to copy different tensors into shared memory in parallel, but with the GIL being held the copies can't truly run in parallel.
Here's a short example of this:
```
import torch
import time
from multiprocessing.pool import ThreadPool
tensors = []
for i in range(64):
for j in range(8):
t = torch.ones(128, 480, 640).type(torch.uint8) * i
tensors.append(t)
print("Done generating input tensors")
with ThreadPool(processes=8) as pool:
futures = []
before = time.time()
for t in tensors:
future = pool.apply_async(t.share_memory_)
futures.append(future)
for f in futures:
f.get()
elapsed = time.time() - before
print("ELAPSED TIME", elapsed)
```
With this diff, I get:
```
~$ python repro.py
Done generating input tensors
ELAPSED TIME 3.561321258544922
~$
```
Previously, I would get:
```
~$ python repro.py
Done generating input tensors
ELAPSED TIME 16.305657386779785
~$
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/85389
Approved by: https://github.com/albanD