pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Fadi Botros	375ec25f55	Add missing aten::sort.any op for assistant lm models (#123982 ) Differential Revision: D56084098 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123982 Approved by: https://github.com/JacobSzwejbka	2024-04-23 01:35:07 +00:00
cyy	ea61c9cb29	[Distributed] [5/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124043 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/124032. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124043 Approved by: https://github.com/ezyang	2024-04-23 00:43:50 +00:00
Ashwin Hari	5f5778476a	rename ort to maia (#123265 ) Fixes #123264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123265 Approved by: https://github.com/albanD	2024-04-23 00:33:25 +00:00
Shuqiang Zhang	87a35d5a29	Use new function to log one cluster per line (#124628 ) Summary: For motivation behind the overall stack of diffs see D56218385 summary. This particular diff makes cpp_dumper take a custom printer function to log callstacks one-group-at-a-time and as such no longer running into 30K characters limit of `LOG(INFO)`. Test Plan: ``` [romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ buck2 test //caffe2/torch/csrc/distributed/c10d/... File changed: fbcode//common/base/ThreadStackTrace.cpp File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/fb/TraceUtils.cpp File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp 4 additional file change events Buck UI: https://www.internalfb.com/buck2/d8ceae86-7d6f-4779-ad0c-8e37eddcff98 Network: Up: 0B Down: 0B Jobs completed: 2. Time elapsed: 1.5s. Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0 NO TESTS RAN [romanmal@46150.od /data/sandcastle/boxes/fbsource/fbcode (520a7b7b5)]$ ``` Tested to print the stack trace: P1220109730 Differential Revision: D56218360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124628 Approved by: https://github.com/wconstab	2024-04-22 21:57:39 +00:00
Jeff Daily	6ede882c0b	preferred blas library; cublaslt gemm implementation (#122106 ) Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources. The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106 Approved by: https://github.com/lezcano	2024-04-22 15:38:22 +00:00
Chen, Zejun	b1984237a0	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi	2024-04-22 01:26:55 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Xiaodong Wang	57f64197f3	Reduce warning msg in torch.profiler (#124469 ) Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once? Test Plan: On AMD GPU side, I got a lot of those warnings: ``` W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())” ``` So just suppress the excessive logs Reviewed By: aaronenyeshi, yoyoyocmu Differential Revision: D55602788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469 Approved by: https://github.com/aaronenyeshi	2024-04-20 04:45:12 +00:00
Scott Wolchok	3d8b903d95	[PyTorch] Remove ArrayRefTensor::numel_ (#124516 ) ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion. Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-04-20 02:44:20 +00:00
PyTorch MergeBot	0feab7d6c3	Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 )" This reverts commit `cb17721899`. Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	929242a15c	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit `d7e1bf9ff9`. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	f87c788a34	Revert "Capture triton kernel in execution trace (#124140 )" This reverts commit `89407eca3b`. Reverted https://github.com/pytorch/pytorch/pull/124140 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/124140#issuecomment-2067137104))	2024-04-19 19:05:44 +00:00
ydwu4	e62169a8fa	Support torchbind op dispatch in python (#123367 ) We override the `__call__` method and register fake, functional, proxy default dispatch mode implementation in its python_key_mode_table. The idea is: 1. when inputs contains FakeScriptObject, we dispatch it through _get_dispatch mechanism. We implement dispatch mode keys automatically in the operator's constructor. 2. when inputs are not fakified, we dispatch through the original c++ dispatcher. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123367 Approved by: https://github.com/zou3519	2024-04-19 17:17:27 +00:00
Cen Zhao	96724a769b	[ptd] drop ncclGroupStart/end for ncclCommInit (#124363 ) (#124416 ) Summary: ``` ncclGroupStart() ncclCommInit(..) ncclGroupEnd() ``` above pattern is only needed when we have single-thread to manage multiple GPUs in our case, we always have 1 process managing 1 GPU, we don't need group operation. Test Plan: CI Differential Revision: D56274975 Co-authored-by: Cen Zhao <cenzhao@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124416 Approved by: https://github.com/shuqiangzhang	2024-04-19 13:12:42 +00:00
PyTorch MergeBot	520bc1080e	Revert "[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 )" This reverts commit `768ce2cdda`. Reverted https://github.com/pytorch/pytorch/pull/123247 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123247#issuecomment-2066152611))	2024-04-19 09:09:03 +00:00
Nikita Shulga	1ba85b34dd	[AOTI] Enbale mmaped weights when CUDA is used (#124346 ) By refactoring the logic that returns the start to constant pointer into `_get_constants_start()` method and call it from both CUDA and CPU readers It has no runtime impact, but export time is down from 10m to 3m if mmaped weights are used on AWS p4d.24xlarge Pull Request resolved: https://github.com/pytorch/pytorch/pull/124346 Approved by: https://github.com/mikekgfb, https://github.com/desertfire	2024-04-19 04:47:27 +00:00
Chen, Zejun	768ce2cdda	[Profiler] Unify the device(CUDA, XPU, PrivateUse1) in torch profiler post processing (#123247 ) This PR unifies the CUDA, XPU and PrivateUse1 in the torch profiler. Now CUDA, XPU and PrivateUse1 can together use string object `use_device` to distinguish each other and share one device path for calculating kineto time durations and memory statistics for post processing. #suppress-api-compatibility-check Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123247 Approved by: https://github.com/aaronenyeshi, https://github.com/gujinghui	2024-04-19 03:31:13 +00:00
rzou	889e3eeed3	Avoid cuda init to FakeTensorMode (#124413 ) Also partially fixes #122109 This PR: - We add a C++ flag (only_lift_cpu_tensors) to toggle the torch.tensor(1, device='cuda') ctor strategy. When false (default), it does the current PyTorch behavior of unconditionally constructing a concrete CUDA tensor then calling lift_fresh on it. When true, we instead construct a concrete CPU tensor, call lift_fresh, and then call Tensor.to(device) (under any ambient modes). - FakeTensorMode flips this flag depending on if CUDA is available or not. We don't unconditionally set the flag to True because that is likely BC-breaking. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124413 Approved by: https://github.com/eellison	2024-04-19 02:39:35 +00:00
Tristan Rice	ddd0ed1b43	distributed: templated ring attention (#124215 ) This adds a templated version of the ring attention forwards function as well as tests it with memory efficient attention. This doesn't add support for memory efficient attention in DTensor. That will be added in a follow up PR. This templating is also a POC of how to support other attention ops such as Jagged/nested tensor and as well how to implement striped attention in a scalable way. Misc changes: * Fixes all_to_all_single autograd implementation with CUDA + adds NCCL test * Adds compile support to the ring attention implementations (required some tweaks to process groups) Test plan: ``` pytest test/distributed/_tensor/test_attention.py pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124215 Approved by: https://github.com/wanchaol	2024-04-19 00:57:08 +00:00
Shuqiang Zhang	ca6a0e1348	[c10d] remove the env of TORCH_NCCL_ABORT_IN_DESTROY_PG (#124334 ) Summary: This ENV was introduced to safely rollout the behavior change in destroy process group (e.g., call ncclCommsAbort). Now that this behavior change were already rolled out, we no longer need this env and we should clean up it to keep our code cleaner Test Plan: Modified/existing ut pass Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/124334 Approved by: https://github.com/wconstab	2024-04-18 23:42:55 +00:00
Sheng Fu	89407eca3b	Capture triton kernel in execution trace (#124140 ) Summary: This DIFF is to capture triton kernels in execution trace. Test Plan: buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D56162599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140 Approved by: https://github.com/briancoutinho	2024-04-18 18:38:26 +00:00
egienvalue	d7e1bf9ff9	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- @exported-using-ghexport Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-18 17:38:06 +00:00
egienvalue	cb17721899	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def query(self) -> _bool: ... def synchronize(self) -> None: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD	2024-04-18 17:35:09 +00:00
rzou	648c39c47d	Add OpOverload.redispatch; use it in new custom ops API (#124089 ) A kernel has "dispatcher convention" if there is an additional keyset arg at the beginning of the argument list. This PR: - adds a way to register kernels with dispatcher_convention using Library.impl (pass dispatcher_convention = True) - adds OpOverload.redispatch We use both of the above in the new custom ops API: we register the autograd kernel in dispatcher convention so that we can actually call redispatch like how pytorch built-in ops do it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071	2024-04-18 12:48:04 +00:00
Animesh Jain	f213f262af	[dynamo][cpp-guards] Improve when to use Dict vs DictSubclassGuardManager (#124237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124237 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #124230	2024-04-18 03:33:37 +00:00
William Wen	812bae09be	[dynamo] fix 3.11+ refleak (#124238 ) Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238 Approved by: https://github.com/jansel	2024-04-18 03:02:29 +00:00
Tristan Rice	1ec05c769b	all_gather and reduce_scatter autograd (#123989 ) This adds `all_gather_tensor_autograd` and `reduce_scatter_tensor_autograd` to the functional_collectives library. This only supports `sum` mode for `reduce_scatter` but should be easy to extend in the future. The backwards implementations match the behavior in https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py This follows the pattern of #123599 . Test plan: ```sh pytest test/distributed/test_functional_api.py -k Autograd ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123989 Approved by: https://github.com/wanchaol	2024-04-17 21:32:22 +00:00
PyTorch MergeBot	36f6928a37	Revert "[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556 )" This reverts commit `41613a0803`. Reverted https://github.com/pytorch/pytorch/pull/120556 on behalf of https://github.com/aaronenyeshi due to Breaks GPU Chrome trace UI ([comment](https://github.com/pytorch/pytorch/pull/120556#issuecomment-2061578951))	2024-04-17 15:38:14 +00:00
rzou	47dbfecd37	Rename impl_abstract to register_fake, part 1/2 (#123937 ) This PR: - adds a new torch.library.register_fake and deprecates torch.library.impl_abstract. The motivation is that we have a lot of confusion around the naming so we are going to align the naming with the actual subsystem (FakeTensor). - renames `m.impl_abstract_pystub("fbgemm_gpu.sparse_ops")` to `m.has_python_registration("fbgemm_gpu.sparse_ops")`. No deprecation here yet; I need to test how this works with static initialization. - Renames a bunch of internals to match (e.g. abstractimplpystub -> pystub) I'm scared to rename the Python-side internal APIs (e.g. torch._library.abstract_impl) because of torch.package concerns. I'll do that in its own isolated PR next just in case it causes problems. DEPRECATION NOTE: torch.library.impl_abstract was renamed to to torch.library.register_fake. Please use register_fake. We'll delete impl_abstract in a future version of PyTorch. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123937 Approved by: https://github.com/albanD	2024-04-17 12:46:01 +00:00
Animesh Jain	51cc808ac7	[dynamo][cpp-guards] Missing decref on early returns in DictSubclassGuardManager (#124230 ) I am sad that I missed this earlier. Good thing is that CI caught it. Will be more careful next time. This was the reason https://github.com/pytorch/pytorch/pull/123547 is reverted - https://github.com/pytorch/pytorch/pull/123547#issuecomment-2058350245 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124230 Approved by: https://github.com/mlazos	2024-04-17 02:49:07 +00:00
Shengbao Zheng	42e22bb444	[nccl-pg] Pass pg name and desc to NCCL communicator (#124149 ) Summary: Pass Process Group Name and Desc to NCCL communicator in order to access pg information in NCCL layer. The information is passed as commDesc string(i.e. "<pg_desc>:<pg_name>") Function only valid when NCCL_COMM_DESCRIPTION is defined. Differential Revision: D55703310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124149 Approved by: https://github.com/shuqiangzhang	2024-04-16 20:08:07 +00:00
cyy	c2596fd3e0	[Distributed] [4/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#124032 ) This PR continues to fix some clang-tidy warnings in distributed/c10d code, following https://github.com/pytorch/pytorch/pull/123312. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124032 Approved by: https://github.com/Skylion007	2024-04-16 00:42:18 +00:00
Shivam Raikundalia	9079c76689	Fix Asynchronous PyTorch Profiler Trace (#124080 ) Summary: With the merge of D55925068, we have introduced an overflow issue when recording a trace using dyno gputrace. This is because it is possible for TorchOPs to be enumerated but not have an end time since they were running as the recording ended. By default these events have an end time set to INT_MIN. When finding the duration() for such events using end-start, we get an overflow resulting in a very long duration. This was avoided before because we were dividing the INT_MIN by 1000 because we were trying to convert uS to nS. This change introduces a patch for TorchOps and a future PR will be added to create a more universal guard in kineto. Test Plan: Trace recorded using resnet test. Trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/0/1713199267/localhost/libkineto_activities_2247224.json.gz&bucket=gpu_traces Differential Revision: D56144914 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124080 Approved by: https://github.com/aaronenyeshi	2024-04-16 00:24:32 +00:00
Shengbao Zheng	9fa922c2ed	[profiler] Log process group name instead of pg uid (#124035 ) Summary: As part of the work of unifying process group identifier, log <group_name, group_desc>, instead of pg uid in profiler. - group_name remains as the unique identifier, e.g. “0”, "1" - group_desc will be the user specified name, e.g. "fsdp". Reviewed By: aaronenyeshi, kwen2501 Differential Revision: D55610682 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124035 Approved by: https://github.com/aaronenyeshi	2024-04-15 21:49:06 +00:00
PHLens	9aba918bd8	Support Accelerator OOM Error (#121200 ) (#121702 ) Fixes #121200 This PR introduces AcceleratorOutOfMemoryError for all privateuse1 backend. For python, there is a PyError object which will be set only when privateuse1 is registered. All privateuse1 backend then can use this error for memory errors. Maybe more error types in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121702 Approved by: https://github.com/guangyey, https://github.com/albanD	2024-04-15 21:41:46 +00:00
cyy	b60af92c17	[Distributed] [3/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123312 ) This PR continues to fix some clang-tidy warnings in distributed code, following https://github.com/pytorch/pytorch/pull/122892. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123312 Approved by: https://github.com/Skylion007	2024-04-13 11:45:00 +00:00
cyy	77a45883ce	[Reland] [Distributed] [2/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#123821 ) Reland of #122892 with problematic changes reverted. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123821 Approved by: https://github.com/Skylion007	2024-04-13 00:57:03 +00:00
Shengbao Zheng	585cd117e6	[nccl-pg] print broadcast ncclunique id duration (#123963 ) Summary: Print NCCL PG broadcast nccl unique id duration for measurement. Differential Revision: D56048059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123963 Approved by: https://github.com/wconstab	2024-04-12 23:33:11 +00:00
Yukio Siraichi	e4c887fbf6	[AOTAutograd] Replay views on output using `FunctionalTensor` metas. (#121007 ) Fix: #120336 This PR fixes an issue on AOTAutograd, specifically on backends that don't support views by themselves (e.g. XLA). Previously, AOTAutograd tried to reconstruct output views by calling `as_strided` on the concrete bases using sizes and strides of the outputs that aliased them. Since backends such as XLA doesn't support tensor aliasing, the sizes and strides would be that of a contiguous tensor (not a view tensor). Because of that, calling `as_strided` would error, since the output tensor would be bigger than its base. Instead, this PR applies the sequence of `ViewMeta` gathered for each output during the functionalization phase. Note: we intentionally don't support base tensors that went through metadata mutation, i.e. in-place view operations. In summary, this PR: - Introduces one `FunctionalTensorWrapper` member function alongside its Python APIs - `apply_view_metas(base)`: applies the `ViewMeta` sequence of the given instance onto another base - Introduces a `OutputAliasInfo.functional_tensor` field - Saves the `FunctionalTensorWrapper` instance collected by the functionalization phase - Wraps it with a new `FunctionalTensorMetadataEq` class for comparing only the metadata of the tensors - Plumbs `OutputAliasInfo.functional_tensor` to `gen_alias_from_base` function - Applies the `ViewMeta` sequence of the saved `FunctionalTensor` onto `aliased_base_tensor` - Propagates `OutputAliasInfo.functional_tensor` when updating `fw_metadata` (this PR description was updated in order to reflect the most recent changes) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121007 Approved by: https://github.com/bdhirsh	2024-04-12 16:54:13 +00:00
Florian	41613a0803	[Profiler][PrivateUse1] Profiler support PrivateUse1 key (#120556 ) Summary: 1.Package public headers of kineto if USE_KINETO so that they can be used by PrivateUse1 user. 2.Add PrivateUse1 key to ActivityType. 3. Support PrivateUse1 key in function deviceTypeFromActivity and _supported_activities. 4. Fix some bugs when processing profiler results. Co-authored-by: albanD <desmaison.alban@gmail.com> Co-authored-by: Aaron Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120556 Approved by: https://github.com/aaronenyeshi	2024-04-12 14:28:19 +00:00
Shengbao Zheng	4e9094533e	[c10d/nccl-pg] allow user to pass process group description (#123472 ) Summary: We need a way to allow user set a customized description for a process group, e.g. FSDP, PP. Here are several use cases of user specified group_desc: - Logging: we can easily match a log line and understand what's this collective/pg is used to. - Pytorch traces (e.g. Kineto, Execution Trace) can benefit from the PG desc since trace analysis, benchmarks will be able to easily differentiate PG purpose like FSDP, PP. - Lower layer collectives(e.g. NCCL) debug: we will be able to expose PG desc to NCCL communicator so NCCL layer operations can be easily correlated to a PG. Solution: Add a group_desc field to c10d Differential Revision: D55781850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123472 Approved by: https://github.com/kwen2501	2024-04-12 08:44:21 +00:00
Tristan Rice	358ace1a1b	functional_collectives: add first differentiable collective -- all_to_all_single_grad (#123599 ) This adds the differentiable collective -- all_to_all_single_grad. This is the initial proof of concept PR and I will be adding the remaining collectives in follow up PRs. This adds a new function called `all_to_all_single_autograd` which is the autograd variant of `all_to_all_single`. For backwards compatibility + initial testing we wanted to make the autograd variant separate to avoid regressions. This uses `autograd::Function` to register an Autograd op that calls the original `_c10d_functional::all_to_all_single` via the dispatcher. This works with compile and inductor as opposed to the previous Python implementation that had issues. As this uses the existing `_c10d_functional` ops we don't need to register any meta functions or lowering. To avoid cudaStream issues this explicitly calls `wait_tensor` in the backward method to ensure it runs under the same stream as the async operation. This hurts performance but can be alleviated potentially using `compile`. Related work: https://github.com/pytorch/torchrec/blob/main/torchrec/distributed/comm_ops.py Test plan: ``` pytest test/distributed/test_functional_api.py -k test_all_to_all_single_compile pytest test/distributed/test_functional_api.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/123599 Approved by: https://github.com/yifuwang	2024-04-12 01:48:49 +00:00
Brian Hirsh	2fe672b146	compile: ban mutations on non-compositional uses of as_strided (#122502 ) Fixes https://github.com/pytorch/pytorch/issues/104505 I was originally going to ban all usages of as_strided + mutation in functionalization. But I'm pretty sure that as_strided + mutation is fine when we are calling as_strided on a base tensor. So in this PR I added a slightly more conservative check: if we see an as_strided + mutation, where the input to an as_strided was another view op, then I error loudly in functionalization and link to the github issue above (in case anyone runs into this in the real world) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122502 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-12 01:12:23 +00:00
Shuqiang Zhang	22ba180e55	[c10d] add more fields for periodic logging (#123860 ) Summary: Added the names of the last enquened, started and completed colletives, in addition to their seq ID Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/123860 Approved by: https://github.com/XilunWu	2024-04-12 00:11:07 +00:00
Animesh Jain	2e6871f924	[dynamo][guards-cpp] Early return in DictGuardManager for empty dicts (#123787 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123787 Approved by: https://github.com/jansel ghstack dependencies: #123773	2024-04-11 22:23:28 +00:00
Animesh Jain	b0b7aa201c	[dynamo][cpp-guards] Introduce DictSubclassGuardManager (#123773 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123773 Approved by: https://github.com/jansel	2024-04-11 22:23:28 +00:00
David Berard	6b24ec480c	[Tensor] Detect more cases of symbolic sizes/strides (#123696 ) Previously, we'd just check `has_symbolic_sizes_strides()` to know whether a tensor has symbolic sizes or strides; if does, we skip some profiler logic. But sometimes `has_symbolic_sizes_strides()` returns false, but we do actually have symbolic sizes or strides. So in this change, we add `may_have_symbolic_sizes_strides()` - which should never return false if the tensor has symbolic sizes and strides Why not change `has_symbolic_sizes_strides()`? It seems like there's preexisting logic that assumes that "if has_symbolic_sizes_strides(), then we can assume that this tensor is guaranteed to have symbolic sizes or strides". In this case, we have python-implemented sizes or strides, which should follow a different code path. Differential Revision: [D55947660](https://our.internmc.facebook.com/intern/diff/D55947660/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123696 Approved by: https://github.com/aaronenyeshi, https://github.com/soulitzer	2024-04-11 16:51:52 +00:00
Shuqiang Zhang	e00282fecf	[c10d] make monitorThread sleep when we try to dump (#123788 ) Summary: We seperated the FR dump logic from the desync debug logic, so we no longer set collectiveDebugInfoMode_ to true when we just need FR dump. That's why monitor thread did not sleep and try to kill the process without waiting for the dump. The fix is simple, we should sleep whenever shouldDump_ is true Test Plan: Existing unit tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/123788 Approved by: https://github.com/wconstab	2024-04-11 07:10:46 +00:00
Nikita Shulga	416f532753	[AOTI] Serialize large weights (#123002 ) But appending them to the end of the shared library and mmaping afterwards Disabled by default, but overridable by `config.aot_inductor.force_mmap_weights` Implemented by adding `USE_MMAP_SELF` define to `inductor/aoti_runtime/model.h` which is defined when weights are appended to the binary. In that case, shared library name is determined by calling `dladdr`, mmaped and finally checked against random magic number embedded at the end of the weights as well as in const section of the library in question Added unites to validate that it works as expected TODO: - Extend support to CUDA - munmap region if the same library is reused Pull Request resolved: https://github.com/pytorch/pytorch/pull/123002 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/mikekgfb	2024-04-11 06:39:58 +00:00
Shivam Raikundalia	3ebbeb75fd	[Profiler] Make Kineto traces export ns granularity for finer timestamps (#122425 ) (#123650 ) Summary: Kineto traces use microsecond level granularity because of chrome tracing defaults to that precision. Fix by adding preprocessor flag to TARGETS and BUCK files. Also remove any unnecessary ns to us conversions made in the profiler itself. This diff contains profiler changes only. Libkineto changes found in D54964435. Test Plan: Check JSON and chrome tracing to make sure values are as expected. Tracing with flags enabled should have ns precision. Tracings without flags should be same as master. Zoomer: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=796886748550189 Ran key_averages() to make sure FunctionEvent code working as expected: -- ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg Self CUDA Self CUDA % CUDA total CUDA time avg # of Calls ProfilerStep* 0.74% 3.976ms 64.40% 346.613ms 69.323ms 0.000us 0.00% 61.710ms 12.342ms 5 Optimizer.zero_grad#SGD.zero_grad 0.76% 4.109ms 0.76% 4.109ms 821.743us 0.000us 0.00% 0.000us 0.000us 5 ## forward ## 6.89% 37.057ms 27.19% 146.320ms 29.264ms 0.000us 0.00% 58.708ms 11.742ms 5 aten::conv2d 0.22% 1.176ms 7.74% 41.658ms 157.199us 0.000us 0.00% 27.550ms 103.962us 265 aten::convolution 0.79% 4.273ms 7.52% 40.482ms 152.762us 0.000us 0.00% 27.550ms 103.962us 265 aten::_convolution 0.69% 3.688ms 6.73% 36.209ms 136.637us 0.000us 0.00% 27.550ms 103.962us 265 aten::cudnn_convolution 6.04% 32.520ms 6.04% 32.520ms 122.719us 27.550ms 8.44% 27.550ms 103.962us 265 aten::add_ 2.42% 13.045ms 2.42% 13.045ms 30.694us 12.700ms 3.89% 12.700ms 29.882us 425 aten::batch_norm 0.19% 1.027ms 8.12% 43.717ms 164.971us 0.000us 0.00% 16.744ms 63.185us 265 aten::_batch_norm_impl_index 0.31% 1.646ms 7.93% 42.691ms 161.096us 0.000us 0.00% 16.744ms 63.185us 265 ------------------------------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ ------------ Differential Revision: D55925068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123650 Approved by: https://github.com/aaronenyeshi	2024-04-11 04:29:20 +00:00

1 2 3 4 5 ...

13670 Commits