pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Frank Lin	249e65b92d	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9	2024-03-27 01:14:38 +00:00
Edward Z. Yang	85845a29db	Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310 Approved by: https://github.com/masnesral, https://github.com/lezcano	2024-03-26 14:16:33 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit `e9dcda5cba`. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
Edward Z. Yang	476585b190	Preserve unbacked SymInt on SymNode (#120816 ) Previously, when we applied a replacement, a SymInt that was previously an unbacked SymInt would then transmute into whatever we replaced it into (e.g., a constant). This has a major downside: we often look at SymInts associated with FX nodes (e.g., the meta of x.item() return) to find out where the unbacked SymInt was allocated. If we replace it, we no longer can find out where, e.g., u1 was allocated! But we need to know this so we can generate deferred runtime asserts like u1 == s0. To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites: * `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`. * `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred. * `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False` * `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue * `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression. There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming. Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR. Fixes https://github.com/pytorch/pytorch/issues/119689 Fixes https://github.com/pytorch/pytorch/issues/118385 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816 Approved by: https://github.com/lezcano	2024-03-24 02:56:16 +00:00
liqunfu	bbe846f430	Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828 ) Start to fix https://github.com/pytorch/pytorch/issues/114801 Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828 Approved by: https://github.com/thiagocrepaldi	2024-03-22 18:01:33 +00:00
Sahdev Zala	17175cdbc7	[Docs] Add extended debugging options for troubleshooting (#122028 ) Fixes #120889 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-03-21 17:00:45 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Nathan	ae983d2d6e	Fix typo in sparse.rst (#121826 ) Change word "on" to "one" when talking in the third person. Fixes #121770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826 Approved by: https://github.com/janeyx99	2024-03-19 00:17:19 +00:00
Jane Xu	37e563276b	Document complex optimizer semantic behavior (#121667 ) <img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667 Approved by: https://github.com/albanD	2024-03-16 00:43:47 +00:00
Tugsbayasgalan Manlaibaatar	53d2188df9	Update get_aten_graph_module (#121937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937 Approved by: https://github.com/andrewor14	2024-03-15 20:35:55 +00:00
Aidyn-A	af86d67d61	[Doc][NVTX] Add documentation for nvtx.range (#121699 ) The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699 Approved by: https://github.com/janeyx99	2024-03-15 20:26:44 +00:00
lezcano	d0d09f5977	Fix torch.compile links (#121824 ) Fixes https://github.com/pytorch/pytorch.github.io/issues/1567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824 Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet ghstack dependencies: #121823	2024-03-15 19:49:37 +00:00
Matthias Reso	a9274c9a2c	Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672 ) This PR corrects the example in the AOTInductor example which currently fails with: ``` /home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’ 21 \| std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl; \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672 Approved by: https://github.com/desertfire	2024-03-12 23:43:40 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit `d482614fec`. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
Lucas Pasqualin	d482614fec	[DCP] Makes fsspec public (#121508 ) Fixes #118033 Also removes `_checkpointer.py` class original PR's: - https://github.com/pytorch/pytorch/pull/121330 - https://github.com/pytorch/pytorch/pull/121329 We're also disabling `test_fsdp` since it is failing on random PR's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508 Approved by: https://github.com/fegin	2024-03-09 01:14:18 +00:00
chilli	ed8eebd1c2	Changed cublas repdocubility URL (#121534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534 Approved by: https://github.com/Skylion007	2024-03-08 23:46:21 +00:00
angelayi	f2d5e96db4	[export] Add docs for 2.3 release (#121466 ) - Added docs about non-strict export - Added example using derived dims - Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480) - Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466 Approved by: https://github.com/zhxchen17	2024-03-08 22:29:48 +00:00
Ke Wen	c78f72d7e7	[c10d] Deprecate torch.distributed.pipeline (#121464 ) In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464 Approved by: https://github.com/wz337, https://github.com/awgu	2024-03-08 19:55:02 +00:00
Wanchao Liang	30982ce072	[tp] doc fixes (#121431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431 Approved by: https://github.com/wz337	2024-03-08 17:46:44 +00:00
Karol Blaszczak	8ed0932172	Update link to OpenVINO backend in torch.compiler.rst (#121303 ) This is a permalink, so it will remain active regardless of documentation version changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303 Approved by: https://github.com/soulitzer	2024-03-08 08:17:13 +00:00
Chien-Chin Huang	2e789ad522	[DCP][state_dict][doc] Update the distributed state_dict document (#121290 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290 Approved by: https://github.com/LucasLLC ghstack dependencies: #121273, #121276	2024-03-08 07:58:18 +00:00
Lucas Pasqualin	96ed37ac13	[DCP] Makes async_save public (#121325 ) Makes async_save public Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325 Approved by: https://github.com/wz337 ghstack dependencies: #121317	2024-03-08 05:13:13 +00:00
Cheng Ni	9bff1599b6	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 ) Summary: ## No Functional Change - Refactor Subprocess Handler into a separate folder for easier subclassing - SubprocessHandler - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class - pass in `local_rank_id` from subprocess start Test Plan: No functional changes. Differential Revision: D54038627 #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373 Approved by: https://github.com/kurman	2024-03-08 01:37:34 +00:00
Dheeraj Peri	b1657beac1	feat: Add min, max ranges to mark_dynamic API (#119737 ) Fixes https://github.com/pytorch/pytorch/issues/115137 This PR adds: - mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim. - test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds. Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-03-07 23:26:03 +00:00
suo	c3c15eb9a6	[export] update docs to not export raw functions (#121272 ) as title Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272 Approved by: https://github.com/zhxchen17	2024-03-07 17:18:07 +00:00
Wanchao Liang	1a28ebffb3	[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295 ) As titled, this PR introduces a dedicated `ParallelStyle` to shard the nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual distribute_module calls before when sharding the RMSNorm layer, but I think we should have a dedicate TP API to easily shard those layers, instead of user manually using DTensors. I call this SequenceParallel, which might bring some confusion that we technically "deprecated" a SequenceParallel style months ago. But this time the SeuqenceParallel style is significantly different with the previous ones (which used to shard two consecutive Linear layers). I believe making it the right name is the first priority, instead of worrying about the issue of reusing the old name Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #121294	2024-03-07 02:04:59 +00:00
Zhengxu Chen	8aeb247a3d	[export] Remove WrapperModule. (#121042 ) Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D54326331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042 Approved by: https://github.com/angelayi	2024-03-05 18:10:22 +00:00
Tianyu Liu	af5376c444	[dtensor] add support for loss parallel (#119877 ) Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input replicated on the class dimension. 3. However when the input of this loss calculation is sharded on the class dimension, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives in the middle of those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to decompose these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877 Approved by: https://github.com/wanchaol	2024-03-02 05:06:26 +00:00
Lucas Pasqualin	9d5dea7812	[DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816 ) as title Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816 Approved by: https://github.com/fegin	2024-03-01 00:21:05 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Oleg Khabinov	4b18ab869f	[torch.export] Support is_compiling() flag for non-strict mode (#119602 ) Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models. Test Plan: Unit tests and manual testing. Differential Revision: D53624452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602 Approved by: https://github.com/suo	2024-02-29 05:52:51 +00:00
Yu, Guangye	12995a5d9d	[2/2] Intel GPU Runtime Upstreaming for Generator (#118613 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 05:28:11 +00:00
Lucas Pasqualin	1c1028ac49	[DCP] Adds utility for converting torch save to dcp (#119815 ) as title Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815 Approved by: https://github.com/fegin ghstack dependencies: #119813, #119814	2024-02-22 17:22:11 +00:00
Lucas Pasqualin	1ab441a7dd	[DCP] Adds utility for converting dcp to torch save format (#119814 ) as title Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814 Approved by: https://github.com/fegin ghstack dependencies: #119813	2024-02-22 16:55:58 +00:00
Hugues de Saxcé	8464654ae4	Add missing words to torch.utils.checkpoint doc (#120196 ) This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it. Changes are: - "backward." -> "backward propagation." - "to be advanced than" -> "to be more advanced than" Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196 Approved by: https://github.com/soulitzer	2024-02-20 20:18:42 +00:00
andrewor14	6ea4480818	[quant][pt2e] Add `model_is_exported` util function (#119726 ) Summary: This commit adds the `model_is_exported` util function for users to be able to easily tell what APIs to call to move their models between train and eval modes. This has the additional advantage of hiding the implementation of how we detect a model is exported, in case the metadata format changes in the future. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726 Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD	2024-02-16 19:29:36 +00:00
soulitzer	312ce35c1f	Rename singleton int to nested int (#119661 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119661 Approved by: https://github.com/ezyang	2024-02-16 19:21:17 +00:00
Yu, Guangye	8f9f12c068	Intel GPU Runtime Upstreaming for Device Allocator (#118091 ) # Motivation According to [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842) and [[RFC] Intel GPU Runtime Upstreaming for Allocator](https://github.com/pytorch/pytorch/issues/116322), we will upstream the key functionality of device `Allocator` dedicated for XPU to PyTorch. And following our design prepare to generalize `Allocator` in parallel. # Design In the current design, XPU uses an `XPUAllocator` class, inherited from `c10::Allocator`. `XPUAllocator` is a manager to handle `DeviceCachingAllocator`, which is a per-device implementation of the caching mechanism to manage the already cached or newly allocated memory. The caching mechanism is similar to other backends, like CUDA. We can visualize the design as below. <p align="center"> <img width="162" alt="image" src="https://github.com/pytorch/pytorch/assets/106960996/6b17b8cf-e7d1-48b4-b684-f830c409d218"> </p> # Additional Context We're going to implement our design gradually. This PR covers the device `Allocator` dedicated to XPU. The second PR covers the host `Allocator`. Besides these PRs, we plan to generalize the device `Allocator` device-agnostic through another PR. In this PR, our device `Allocator` has the same memory management mechanism as CUDA, but lacks features such as expendable segments and statistics. We will add these features back in the subsequent PR which intend to generalize `Allocator`. The differences with CUDA: only key functionality, and lack of AsyncAllocator, gpu_trace, history_record, graph functionality, memory snapshot, memory statistics, expandable segment... Pull Request resolved: https://github.com/pytorch/pytorch/pull/118091 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/albanD ghstack dependencies: #117611, #117619, #117734	2024-02-16 06:46:00 +00:00
Yu, Guangye	4dc75f9084	Intel GPU Runtime Upstreaming for Event (#117734 ) # Motivation As mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the next runtime component we would like to upstream is `Event` which handles the status of an operation that is being executed. Typically, in some circumstances, we can fine-grain control of the operation execution via `Event`. # Design `XPUEvent` is a movable but not a copyable wrapper around sycl event. It should be created lazily on an XPU device when recording an `XPUStream`. Meanwhile, `XPUEvent` can wait for another `XPUEvent` or all the submitted kernels on an `XPUStream` to complete. Align to the other backend, the C++ files related to `Event` will be placed in `aten/src/ATen/xpu` folder. For frontend code, `XPUEvent` runtime API will be bound to Python `torch.xpu.Event`. The corresponding C++ code will be placed in `torch/csrc/xpu/Event.cpp` and Python code will be placed in `torch/xpu/streams.py` respectively. # Additional Context It is worth mentioning that the `elapsed_time` method is temporarily not supported by `XPUEvent`. We will be adding support for it soon. Meanwhile `XPUEvent` doesn't support IPC from different processes. For the other parts, we have almost a 1:1 mapping with CUDA. lack of the below APIs: - `torch.cuda.Event.ipc_handle` - `CUDAEvent`'s constructor with `IpcEventHandle` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117734 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/jgong5, https://github.com/malfet ghstack dependencies: #117611, #117619	2024-02-16 06:28:26 +00:00
lancerts	444c628e06	Include the scalar tensor auto-transfer in the doc (#119967 ) Fixes #119609 @albanD Co-authored-by: albanD <desmaison.alban@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119967 Approved by: https://github.com/albanD	2024-02-15 22:37:39 +00:00
drisspg	744898b311	Add doc page for environment variables that effect PyTorch Runtime (#119087 ) # Summary The goal of this PR is to add a doc page to list a number of environment that effect the PyTorch runtime. It will likely not be exhaustive but hopefully will be added and updated to stay relevant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119087 Approved by: https://github.com/janeyx99, https://github.com/eqy	2024-02-15 21:41:38 +00:00
Kazuaki Ishizaki	a2f07bb317	Fix typo under docs directory (#119657 ) This PR fixes typo under `docs` directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119657 Approved by: https://github.com/colesbury	2024-02-15 21:14:34 +00:00
Eddie Yan	cd380c794f	[CUDNN][SDPA] Experimental cuDNN Flash Attention v2 Inference (#115663 ) #113713 Going to clean up some of the checks and will remove draft status after. Can be tested on SM80+ with `TORCH_CUDNN_MHA_ENABLED=1`. CC @drisspg @ptrblck Pull Request resolved: https://github.com/pytorch/pytorch/pull/115663 Approved by: https://github.com/drisspg	2024-02-14 22:02:06 +00:00
andrewor14	8ec8d78ef2	[quant][pt2e][be] Rename eval_utils -> export_utils (#119725 ) It's not really eval_utils anymore, since we added some training related utils. Instead it should be util functions that are related to general export use cases. Differential Revision: [D53711494](https://our.internmc.facebook.com/intern/diff/D53711494) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119725 Approved by: https://github.com/tugsbayasgalan	2024-02-13 19:10:06 +00:00
Yu, Guangye	8fd11cb307	[2/2] Intel GPU Runtime Upstreaming for Stream (#117619 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Stream](https://github.com/pytorch/pytorch/pull/117611), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers stream-related APIs, including - `torch.xpu.StreamContext` - `torch.xpu.current_stream` - `torch.xpu.set_stream` - `torch.xpu.synchronize` - `torch._C._xpu_getCurrentRawStream` # Additional Context We will implement functions like `torch.xpu.Stream.wait_event`, `torch.xpu.Stream.wait_stream`, and `torch.xpu.Stream.record_event` in the next PR related with `Event`. The differences with CUDA: no default and external stream in XPU and lack of below APIs: - `torch.cuda.ExternalStream` - `torch.cuda.default_stream` - `toch.cuda.is_current_stream_capturing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117619 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #117611	2024-02-10 03:39:42 +00:00
Mikayla Gawarecki	3372aa51b4	Integrate swap_tensors into nn.Module.load_state_dict (#117913 ) Added a `torch.Tensor` method that defines how to transform `other`, a value in the state dictionary, to be loaded into `self`, a param/buffer in an `nn.Module` before swapping via `torch.utils.swap_tensors` * `param.module_load(sd[key])` This method can be overridden using `__torch_function__`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/117913 Approved by: https://github.com/albanD	2024-02-09 22:32:29 +00:00
Angela Yi	0827510fd3	[export] Remove torch._export.export (#119095 ) XLA changes: https://github.com/pytorch/xla/pull/6486 Test Plan: CI Differential Revision: D53316196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119095 Approved by: https://github.com/ydwu4, https://github.com/zhxchen17, https://github.com/tugsbayasgalan, https://github.com/avikchaudhuri, https://github.com/jerryzh168	2024-02-08 21:22:04 +00:00
Yu, Guangye	9a992b0918	[4/4] Intel GPU Runtime Upstreaming for Device (#116869 ) # Motivation According to [[1/4] Intel GPU Runtime Upstreaming for Device](https://github.com/pytorch/pytorch/pull/116019), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), this last PR covers the changes under lazy initialization. # Design This PR primarily offers the support of multi-processing via lazy initialization. We lazily initialize our runtime avoiding initializing XPU until the first time it is accessed. In our design, we extend `cuda_lazy_init` to `device_lazy_init` which is a device-agnostic API that can support any backend. And change `maybe_initialize_cuda` to `maybe_initialize_device` to support lazy initialization for both CUDA and XPU while maintaining scalability. # Additional Context We adopt a similar design to CUDA. So we share some code with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116869 Approved by: https://github.com/EikanWang, https://github.com/jgong5, https://github.com/gujinghui, https://github.com/malfet ghstack dependencies: #119248	2024-02-08 03:01:21 +00:00
Mateus Devino	64aaa8f508	Fix typo on Contribution Guide (#119428 ) Fixes #119427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119428 Approved by: https://github.com/awgu, https://github.com/kit1980	2024-02-08 01:07:27 +00:00
Mikayla Gawarecki	d5a718d27b	Add swap_tensors path to nn.Module._apply (#117167 ) Added `torch.__future__.{get/set}_swap_module_params_on_conversion` that defaults to `False` for now, but we probably want to modify to override this and default to `True` in `nn.Module._apply` if input is a tensor subclass. From offline discussion, for now we are not allowing `swap_tensor` after the first module forward has been run* if the autograd graph is still alive. The reason being that `torch.utils.swap_tensors(t1, t2)` requires the `use_count` of both `TensorImpl`s associated with `t1` and `t2` to be 1. The first forward pass will install `AccumulateGrad` nodes on each param, which [bump the refcount of the associated TensorImpl](`6cf1fc66e3/torch/csrc/autograd/variable.cpp (L307)`). Future work might be to swap the refs that the `AccumulateGrad` nodes hold if it is necessary. From this, it might seem like we don't need to handle gradients. However, I still handle the grads for the edge case that the grads are set via `p.grad = grad` OR the autograd graph is no longer alive because the output has been garbage collected. If any `swap_tensors` fails on any of the parameters in the `nn.Module` we raise an error. `RNNBase` overrides `nn.Module._apply()` and installs weakrefs on some parameters. As a result, all modules that inherit from `RNNBase` (`RNN`, `GRU` and `LSTM`) cannot use the`swap_tensors` path as of now* Pull Request resolved: https://github.com/pytorch/pytorch/pull/117167 Approved by: https://github.com/albanD ghstack dependencies: #118028	2024-02-07 18:55:44 +00:00

1 2 3 4 5 ...

2446 Commits