pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Gagan Jain	61d431fab0	Adding health check server hook in torch elastic (#122750 ) Summary: Building hook for external mechanism to monitor the health of torch elastic launcher. Health check server takes dependency on FileTimerServer to check if launcher is healthy or not. It will be always healthy if FileTimerServer is disabled. Implementation of start_healthcheck_server is unsupported, however tcp/http server can be started on specific port which can monitor the aliveness of worker_watchdog and accordingly take the action. Test Plan: buck test mode/opt caffe2/test/distributed/elastic/agent/server/test:local_agent_test Differential Revision: D55108182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122750 Approved by: https://github.com/kurman	2024-04-05 23:17:30 +00:00
Huy Do	f5b8c9b730	Ignore some known duplicated modules in doc build config script (#123425 ) This is a follow-up fix of https://github.com/pytorch/pytorch/pull/123244#discussion_r1552935150 as @clee2000 points out a better way to ignore those duplicated entries. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123425 Approved by: https://github.com/clee2000	2024-04-05 21:12:14 +00:00
Lucas Pasqualin	de7edeea25	[DCP] DCP logger (#121352 ) Adds additional logging for improved observability in DCP. Differential Revision: [D54512626](https://our.internmc.facebook.com/intern/diff/D54512626/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121352 Approved by: https://github.com/wz337, https://github.com/fegin	2024-04-05 17:50:50 +00:00
Guilherme Leobas	c575e378ba	Update torch.compile_faq w.r.t to functorch (#122213 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122213 Approved by: https://github.com/zou3519 ghstack dependencies: #122211, #122212	2024-04-05 03:29:11 +00:00
Guilherme Leobas	84658d9c4f	Enable `capture_func_transforms` by default (#122211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122211 Approved by: https://github.com/zou3519	2024-04-05 03:29:11 +00:00
Huy Do	3d20cc1332	Cleanup some duplicated placeholder py:module docs (#123244 ) Fixes https://github.com/pytorch/pytorch/issues/123068 Fixes https://github.com/pytorch/pytorch/issues/111256 While investigating the flaky doc build failure .w.r.t duplicated `torch.ao.quantization.quantize` docstring warning, i.e. https://github.com/pytorch/pytorch/actions/runs/8532187126/job/23376591356#step:10:1260, I discover an old but still open bug in Sphinx https://github.com/sphinx-doc/sphinx/issues/4459. These warnings have always been there, but they are hidden because we are using `-j auto` to build docs with multiple threads. It's just by chance that they start to surface now. The issue can be reproduced by removing `-j auto` from https://github.com/pytorch/pytorch/blob/main/docs/Makefile#L5 and run `make html` locally. Then, these warnings shows up consistently. As `make html` treats warnings as errors, they will fail the build. ``` ... /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/ao/quantization/quantize.py:docstring of torch.ao.quantization.quantize.quantize:1: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in quantization, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:docstring of torch.nn.parallel.data_parallel.data_parallel:1: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in nn, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/spectral_norm.py:docstring of torch.nn.utils.spectral_norm.spectral_norm:1: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in nn, use :noindex: for one of them /data/users/huydo/conda/py3.8/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:docstring of torch.nn.utils.weight_norm.weight_norm:1: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in nn, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:579: WARNING: duplicate object description of torch.nn.parallel.data_parallel, other instance in generated/torch.nn.functional.torch.nn.parallel.data_parallel, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:594: WARNING: duplicate object description of torch.nn.utils.spectral_norm, other instance in generated/torch.nn.utils.spectral_norm, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/nn.rst:595: WARNING: duplicate object description of torch.nn.utils.weight_norm, other instance in generated/torch.nn.utils.weight_norm, use :noindex: for one of them /data/users/huydo/github/pytorch/docs/source/quantization.rst:1348: WARNING: duplicate object description of torch.ao.quantization.quantize, other instance in generated/torch.ao.quantization.quantize, use :noindex: for one of them ... ``` The fix is just to clean up those duplicated placeholder py:module docs, which were there because these modules didn't have any docs originally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123244 Approved by: https://github.com/andrewor14, https://github.com/malfet	2024-04-05 03:18:53 +00:00
rzou	44c0c0fc0f	Add torch.library.custom_op (#122344 ) This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will never peek into it) custom op. In this PR, you can specify backend impls and the abstract impl for this op. NB: most of this PR is docstrings, please don't be intimidated by the line count. There are a number of interesting features: - we infer the schema from type hints. In a followup I add the ability to manually specify a schema. - name inference. The user needs to manually specify an op name for now. In a followup we add the ability to automatically infer a name (this is a little tricky). - custom_op registrations can override each other. This makes them more pleasant to work with in environments like colab. - we require that the outputs of the custom_op do not alias any inputs or each other. We enforce this via a runtime check, but can relax this into an opcheck test if it really matters in the future. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-03 18:36:17 +00:00
lezcano	b27ee6548d	Add a Dynamo deepdive to documentation (#122305 ) This supersedes the previous `Guards Overview" as a more comprehensive approach to most of the main topics within Dynamo. In the future, we could add specific sections for each of the topics discussed here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122305 Approved by: https://github.com/msaroufim	2024-04-02 15:08:08 +00:00
Will Feng	489f4a063b	Revert "Preserve unbacked SymInt on SymNode (#120816 )" (#122988 ) This reverts commit `476585b190`. I did a bisect and this seems to be the cause of compile time regression in cudagraphs_dynamic test suite between 03/23 and 03/24: ![image](https://github.com/pytorch/pytorch/assets/4063635/21394e06-4906-4690-b5a2-7d16cc475843) image Particularly BERT_pytorch and hf_T5 seem to have ~50% compile time regression. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122988 Approved by: https://github.com/eellison	2024-04-01 22:11:09 +00:00
Mikayla Gawarecki	487b6d40ec	Add RMSNorm module (#121364 ) Similar to `dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)` The implementation here is not optimized and we welcome pull requests to improve this - Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation - Remove the [upcast to float and downcast ](`dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73)`) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D55485840](https://our.internmc.facebook.com/intern/diff/D55485840) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364 Approved by: https://github.com/albanD	2024-03-29 18:05:28 +00:00
David Berard	59f6393209	[docs] Update PT2+Profiler docs (#122272 ) Document: * Torch-Compiled Region * What to expect in kernels inside a torch-compiled region For review, see https://docs-preview.pytorch.org/pytorch/pytorch/122272/torch.compiler_profiling_torch_compile.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/122272 Approved by: https://github.com/aaronenyeshi	2024-03-28 17:52:28 +00:00
PyTorch MergeBot	8698121636	Revert "Add RMSNorm module (#121364 )" This reverts commit `a7306de0dc`. Reverted https://github.com/pytorch/pytorch/pull/121364 on behalf of https://github.com/atalman due to Broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/121364#issuecomment-2025502007))	2024-03-28 15:31:10 +00:00
Aaron Orenstein	a8b7480f0d	fix dynamo.explain examples (#122745 ) `dynamo.explain()` was updated to return a structure but the docs weren't updated to match. - Update the docs to use the new API - Remove some dead code left when `explain` was updated. - Drive-by: Fix some `nopython` uses that I noticed - Drive-by: I noticed an ignored error coming from CleanupHook on shutdown - make it check the global before setting it. Fixes #122573 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122745 Approved by: https://github.com/jansel	2024-03-27 22:53:27 +00:00
Mikayla Gawarecki	a7306de0dc	Add RMSNorm module (#121364 ) Similar to `dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L51)` The implementation here is not optimized and we welcome pull requests to improve this - Use `normalized_shape` instead of singular integer `dim` to be aligned with the `nn.LayerNorm` implementation - Remove the [upcast to float and downcast ](`dbeed9724b/torchmultimodal/modules/layers/normalizations.py (L73)`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121364 Approved by: https://github.com/albanD	2024-03-27 21:39:30 +00:00
Frank Lin	249e65b92d	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/xuzhao9	2024-03-27 01:14:38 +00:00
Edward Z. Yang	85845a29db	Refactor ShapeEnvSettings so it's directly on ShapeEnv (#122310 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122310 Approved by: https://github.com/masnesral, https://github.com/lezcano	2024-03-26 14:16:33 +00:00
PyTorch MergeBot	4dc09d6aa4	Revert "Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 )" This reverts commit `e9dcda5cba`. Reverted https://github.com/pytorch/pytorch/pull/114068 on behalf of https://github.com/ezyang due to memory leak in another ci ([comment](https://github.com/pytorch/pytorch/pull/114068#issuecomment-2018044527))	2024-03-25 13:49:04 +00:00
Edward Z. Yang	476585b190	Preserve unbacked SymInt on SymNode (#120816 ) Previously, when we applied a replacement, a SymInt that was previously an unbacked SymInt would then transmute into whatever we replaced it into (e.g., a constant). This has a major downside: we often look at SymInts associated with FX nodes (e.g., the meta of x.item() return) to find out where the unbacked SymInt was allocated. If we replace it, we no longer can find out where, e.g., u1 was allocated! But we need to know this so we can generate deferred runtime asserts like u1 == s0. To solve this problem, I have a special mode for replace, resolve_unbacked=False, which lets you disable substitutions on unbacked SymInts. When reporting node.expr, we preferentially avoid applying unbacked SymInt substitutions. To understand if we might accidentally reapply the substitution later, before we have reached the deferred runtime assert, we must study the calls to simplify() in ShapeEnv. My audit turns up these sites: * `produce_guards`: this is fine, deferred runtime asserts never show up here, we must NOT have unbacked SymInts show up here. Similarly `get_nontrivial_guards`. * `_maybe_evaluate_static`: this is fine, we are using this to determine if it is necessary to produce a guard/runtime assert. We don't want to reissue a runtime assert if we've already asserted on it, and replacements can help us understand if this has occurred. * `_simplify_floor_div`: this is a legitimate bug, it needs to be `resolve_unbacked=False` * `_refine_ranges`: this is fine, a refined range doesn't affect what runtime asserts we issue * `_update_divisible`: this updates the `self.divisible` set, which specifies when we can simplify away divisibility constraints. Since this affects replacements only, it won't cause us to oversimplify a user provided expression. There are some situations where we DO want to always apply the substitution, specifically when we have the duplicate symbol problem (we retrace an item call and get u0 and u1 which refer to the same thing.) I don't want two symbols in this case, so a special `rename_unbacked_to` is provided which sets up the unconditional renaming. Along the way, I make a refinement to `_update_var_to_range`: if you update a var range for a size-like unbacked SymInt, you are now no longer allowed to set its lower bound below 2. This is because if you could, then our size oblivious tests for it would be inconsistent. Actually, I think there is still some inconsistency, because if you assert `u0 == 0` we will still end up with this in deferred runtime asserts, and we will then use this to simplify these statements to be True everywhere else. Maybe we should forbid this kind of refinement; not done in this PR. Fixes https://github.com/pytorch/pytorch/issues/119689 Fixes https://github.com/pytorch/pytorch/issues/118385 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/120816 Approved by: https://github.com/lezcano	2024-03-24 02:56:16 +00:00
liqunfu	bbe846f430	Add symbolic_opset19.py and symbolic_opset20.py to support opset 19/20, extend opset 18 support (#118828 ) Start to fix https://github.com/pytorch/pytorch/issues/114801 Co-authored-by: Thiago Crepaldi <thiagofc@microsoft.com> Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118828 Approved by: https://github.com/thiagocrepaldi	2024-03-22 18:01:33 +00:00
Sahdev Zala	17175cdbc7	[Docs] Add extended debugging options for troubleshooting (#122028 ) Fixes #120889 Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122028 Approved by: https://github.com/ezyang, https://github.com/malfet	2024-03-21 17:00:45 +00:00
Frank Lin	e9dcda5cba	Graph-Safe RNG State Exchange for Tensor Parallelism (#114068 ) See #113541 The PR allows for registering and controlling multiple RNG states using indices, ensuring cudagraph-safe operations, and includes both C++ and Python API changes to support this functionality. cc @eellison @anijain2305 @jansel @ezyang @ptrblck @csarofeen @mcarilli Pull Request resolved: https://github.com/pytorch/pytorch/pull/114068 Approved by: https://github.com/ezyang	2024-03-21 01:57:08 +00:00
Nathan	ae983d2d6e	Fix typo in sparse.rst (#121826 ) Change word "on" to "one" when talking in the third person. Fixes #121770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121826 Approved by: https://github.com/janeyx99	2024-03-19 00:17:19 +00:00
Jane Xu	37e563276b	Document complex optimizer semantic behavior (#121667 ) <img width="817" alt="image" src="https://github.com/pytorch/pytorch/assets/31798555/565b389d-3e86-4767-9fcb-fe075b50aefe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/121667 Approved by: https://github.com/albanD	2024-03-16 00:43:47 +00:00
Tugsbayasgalan Manlaibaatar	53d2188df9	Update get_aten_graph_module (#121937 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121937 Approved by: https://github.com/andrewor14	2024-03-15 20:35:55 +00:00
Aidyn-A	af86d67d61	[Doc][NVTX] Add documentation for nvtx.range (#121699 ) The context manager `torch.cuda.nvtx.range` has been around for about 4 years (see #42925). Unfortunately, it was never documented and as a consequence users are just unaware of it (see #121663). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121699 Approved by: https://github.com/janeyx99	2024-03-15 20:26:44 +00:00
lezcano	d0d09f5977	Fix torch.compile links (#121824 ) Fixes https://github.com/pytorch/pytorch.github.io/issues/1567 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121824 Approved by: https://github.com/svekars, https://github.com/peterbell10, https://github.com/malfet ghstack dependencies: #121823	2024-03-15 19:49:37 +00:00
Matthias Reso	a9274c9a2c	Fix aoti doc to avoid cannot bind non-const lvalue reference error (#121672 ) This PR corrects the example in the AOTInductor example which currently fails with: ``` /home/ubuntu/test/inference.cpp:21:62: error: cannot bind non-const lvalue reference of type ‘std::vector<at::Tensor>&’ to an rvalue of type ‘std::vector<at::Tensor>’ 21 \| std::cout << runner.run({torch::randn({2, 10}, at::kCPU)})[0] << std::endl; \| ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121672 Approved by: https://github.com/desertfire	2024-03-12 23:43:40 +00:00
PyTorch MergeBot	0398dc9e8e	Revert "[DCP] Makes fsspec public (#121508 )" This reverts commit `d482614fec`. Reverted https://github.com/pytorch/pytorch/pull/121508 on behalf of https://github.com/osalpekar due to this causes torchrec tests to fail internally with this error: ModuleNotFoundError: No module named 'fsspec'. see [D54779117](https://www.internalfb.com/diff/D54779117) ([comment](https://github.com/pytorch/pytorch/pull/121508#issuecomment-1992137831))	2024-03-12 17:02:43 +00:00
Lucas Pasqualin	d482614fec	[DCP] Makes fsspec public (#121508 ) Fixes #118033 Also removes `_checkpointer.py` class original PR's: - https://github.com/pytorch/pytorch/pull/121330 - https://github.com/pytorch/pytorch/pull/121329 We're also disabling `test_fsdp` since it is failing on random PR's Pull Request resolved: https://github.com/pytorch/pytorch/pull/121508 Approved by: https://github.com/fegin	2024-03-09 01:14:18 +00:00
chilli	ed8eebd1c2	Changed cublas repdocubility URL (#121534 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121534 Approved by: https://github.com/Skylion007	2024-03-08 23:46:21 +00:00
angelayi	f2d5e96db4	[export] Add docs for 2.3 release (#121466 ) - Added docs about non-strict export - Added example using derived dims - Added api docs for ep.run_decompositions() (https://github.com/pytorch/pytorch/issues/119480) - Tried to include/cover everything in https://docs.google.com/document/d/1kZ_BbB3JnoLbUZleDT6635dHs88ZVYId8jT-yTFgf3A/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/121466 Approved by: https://github.com/zhxchen17	2024-03-08 22:29:48 +00:00
Ke Wen	c78f72d7e7	[c10d] Deprecate torch.distributed.pipeline (#121464 ) In favor of PiPPy (Pipeline Parallelism for PyTorch) https://github.com/pytorch/PiPPy Pull Request resolved: https://github.com/pytorch/pytorch/pull/121464 Approved by: https://github.com/wz337, https://github.com/awgu	2024-03-08 19:55:02 +00:00
Wanchao Liang	30982ce072	[tp] doc fixes (#121431 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121431 Approved by: https://github.com/wz337	2024-03-08 17:46:44 +00:00
Karol Blaszczak	8ed0932172	Update link to OpenVINO backend in torch.compiler.rst (#121303 ) This is a permalink, so it will remain active regardless of documentation version changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121303 Approved by: https://github.com/soulitzer	2024-03-08 08:17:13 +00:00
Chien-Chin Huang	2e789ad522	[DCP][state_dict][doc] Update the distributed state_dict document (#121290 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/121290 Approved by: https://github.com/LucasLLC ghstack dependencies: #121273, #121276	2024-03-08 07:58:18 +00:00
Lucas Pasqualin	96ed37ac13	[DCP] Makes async_save public (#121325 ) Makes async_save public Differential Revision: [D54593610](https://our.internmc.facebook.com/intern/diff/D54593610/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121325 Approved by: https://github.com/wz337 ghstack dependencies: #121317	2024-03-08 05:13:13 +00:00
Cheng Ni	9bff1599b6	[Torch Elastic][Draft] Refactor SubprocessHandler to separate module for easier subclass (#120373 ) Summary: ## No Functional Change - Refactor Subprocess Handler into a separate folder for easier subclassing - SubprocessHandler - added `local_rank_id` in `SubprocessHandler` to make it available as a field in the class - pass in `local_rank_id` from subprocess start Test Plan: No functional changes. Differential Revision: D54038627 #suppress-api-compatibility-check Pull Request resolved: https://github.com/pytorch/pytorch/pull/120373 Approved by: https://github.com/kurman	2024-03-08 01:37:34 +00:00
Dheeraj Peri	b1657beac1	feat: Add min, max ranges to mark_dynamic API (#119737 ) Fixes https://github.com/pytorch/pytorch/issues/115137 This PR adds: - mark_dynamic API will accept `min`, `max` values to create a bounded constraint on the dim. - test case in test_misc.py which checks if `ConstraintViolationError` is triggered if `torch.compile` gets a input dimension out of bounds. Co-authored-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/119737 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-03-07 23:26:03 +00:00
suo	c3c15eb9a6	[export] update docs to not export raw functions (#121272 ) as title Differential Revision: [D54555101](https://our.internmc.facebook.com/intern/diff/D54555101/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121272 Approved by: https://github.com/zhxchen17	2024-03-07 17:18:07 +00:00
Wanchao Liang	1a28ebffb3	[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295 ) As titled, this PR introduces a dedicated `ParallelStyle` to shard the nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual distribute_module calls before when sharding the RMSNorm layer, but I think we should have a dedicate TP API to easily shard those layers, instead of user manually using DTensors. I call this SequenceParallel, which might bring some confusion that we technically "deprecated" a SequenceParallel style months ago. But this time the SeuqenceParallel style is significantly different with the previous ones (which used to shard two consecutive Linear layers). I believe making it the right name is the first priority, instead of worrying about the issue of reusing the old name Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #121294	2024-03-07 02:04:59 +00:00
Zhengxu Chen	8aeb247a3d	[export] Remove WrapperModule. (#121042 ) Summary: WrapperModule seems a good idea but may introduce some surprising behavior to users, for example, it never registers enclosed modules as submodules and therefore it's unclear that's the state dict for the exported program should look like, because some people may argue to include every state in state dict but others want to keep them as constants. Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D54326331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121042 Approved by: https://github.com/angelayi	2024-03-05 18:10:22 +00:00
Tianyu Liu	af5376c444	[dtensor] add support for loss parallel (#119877 ) Loss parallel is the last piece of sequence parallelism to enable. It enables efficient distributed cross entropy computation when the input is sharded on the class dimension (in a classification problem with many classes). The implementation is via a context manager `loss_parallel`, after enabling which users can directly use `torch.nn.functional.cross_entropy` or `torch.nn.CrossEntropyLoss` without modifying other parts of their code. Here are the underlying rationales why we are going through these op replacements: 1. `nn.functional.cross_entropy` is the common method that OSS user is using for things like transformer training, to avoid changing user code, we want user to still use this function for loss calculation if they are already using it. 2. `nn.functional.cross_entropy` boils down into `aten.log_softmax` and `aten.nll_loss_foward/backward`, and DTensor now supports those ops already (#117723 #119255 #118917 #119256). They are doing computation with input replicated on the class dimension. 3. However when the input of this loss calculation is sharded on the class dimension, to run sharded computation efficiently, we need to run both `aten.log_softmax` and `aten.nll_loss_foward` with multiple all-reduce collectives in the middle of those aten ops. This is not possible if we are just overriding these two ops, so we need to have some way to decompose these two ops into smaller ops to have collectives run in the middle of these two ops. 4. We explored the existing decompositions (#118950). It seems working, except that `log_softmax_backward` and `nll_loss_backward` combined together in aten are implemented in a inefficient way, which would trigger an additional expensive collective. Recently some user also reported similar issues https://github.com/pytorch/pytorch/issues/119261. 5. Therefore, currently we are doing our own decomposition inside a context manager for sequence parallelism specifically. Once we have a better decomposition in core, we can possibly take that instead of reinventing the wheels here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119877 Approved by: https://github.com/wanchaol	2024-03-02 05:06:26 +00:00
Lucas Pasqualin	9d5dea7812	[DCP] Adds storage reader and planner classes for online loading/sharding of models in torch.save format (#119816 ) as title Differential Revision: [D53718041](https://our.internmc.facebook.com/intern/diff/D53718041/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119816 Approved by: https://github.com/fegin	2024-03-01 00:21:05 +00:00
Kurman Karabukaev	67d3e4f2a2	[TorchElastic] Refactoring to support non-default logging strategy (#120691 ) Summary: Pulling out logging parameters into a logging specs that can be overridden (follow-up changes on possible mechanism) Why? Right now the logging approach is quite rigid: - Requires for log directory to exist and not be empty - Will create tempdir otherwise, - Creates subdir for a run - creates subdir for each attempt - creates files named as stdout.log, stderr.log, error.json In some instances some of the users would like to customize the behavior including file names based on context. And we do have right now a mechanism to template multiplexed teed output prefix. With current changes, users can create custom log spec that can use env variables to change the behavior. Notes: Made `LaunchConf.logs_specs` as an optional field that will be bound to `DefaultLogsSpecs` instance. There are large number of clients (code) that use the API directly without using torchrun API. For those cases, we have to explicitly pass LogSpecs implementation if we would like to override the implementation. For the regular torchrun users, we can use pluggable approach proposed in the follow up change. Test Plan: CI + unit tests Differential Revision: D54176265 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120691 Approved by: https://github.com/ezyang	2024-02-29 20:59:17 +00:00
Oleg Khabinov	4b18ab869f	[torch.export] Support is_compiling() flag for non-strict mode (#119602 ) Summary: In non-strict mode of torch.export() we didn't set those `is_compiling()` to `True` which is needed by some models. Test Plan: Unit tests and manual testing. Differential Revision: D53624452 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119602 Approved by: https://github.com/suo	2024-02-29 05:52:51 +00:00
Yu, Guangye	12995a5d9d	[2/2] Intel GPU Runtime Upstreaming for Generator (#118613 ) # Motivation According to [[1/2] Intel GPU Runtime Upstreaming for Generator](https://github.com/pytorch/pytorch/pull/118528), as mentioned in [[RFC] Intel GPU Runtime Upstreaming](https://github.com/pytorch/pytorch/issues/114842), the second PR covers the changes under `python frontend`. # Design Currently, it primarily offers geneartor-related APIs, including - `torch.xpu.default_generators` - `torch.xpu.get_rng_state` - `torch.xpu.get_rng_state_all` - `torch.xpu.initial_seed` - `torch.xpu.manual_seed` - `torch.xpu.manual_seed_all` - `torch.xpu.seed` - `torch.xpu.seed_all` - `torch.xpu.set_rng_state` - `torch.xpu.set_rng_state_all` # Additional Context The differences with CUDA: The generator-related frontend python APIs are 1:1 mapping with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118613 Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/jgong5, https://github.com/albanD	2024-02-28 05:28:11 +00:00
Lucas Pasqualin	1c1028ac49	[DCP] Adds utility for converting torch save to dcp (#119815 ) as title Differential Revision: [D53718040](https://our.internmc.facebook.com/intern/diff/D53718040/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119815 Approved by: https://github.com/fegin ghstack dependencies: #119813, #119814	2024-02-22 17:22:11 +00:00
Lucas Pasqualin	1ab441a7dd	[DCP] Adds utility for converting dcp to torch save format (#119814 ) as title Differential Revision: [D53718042](https://our.internmc.facebook.com/intern/diff/D53718042/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119814 Approved by: https://github.com/fegin ghstack dependencies: #119813	2024-02-22 16:55:58 +00:00
Hugues de Saxcé	8464654ae4	Add missing words to torch.utils.checkpoint doc (#120196 ) This PR adds a couple of missing words in the Checkpointing documentation, it doesn't have a specific issue number related to it. Changes are: - "backward." -> "backward propagation." - "to be advanced than" -> "to be more advanced than" Pull Request resolved: https://github.com/pytorch/pytorch/pull/120196 Approved by: https://github.com/soulitzer	2024-02-20 20:18:42 +00:00
andrewor14	6ea4480818	[quant][pt2e] Add `model_is_exported` util function (#119726 ) Summary: This commit adds the `model_is_exported` util function for users to be able to easily tell what APIs to call to move their models between train and eval modes. This has the additional advantage of hiding the implementation of how we detect a model is exported, in case the metadata format changes in the future. Test Plan: python test/test_quantization.py TestQuantizePT2E.test_model_is_exported Differential Revision: [D53812972](https://our.internmc.facebook.com/intern/diff/D53812972) Pull Request resolved: https://github.com/pytorch/pytorch/pull/119726 Approved by: https://github.com/tugsbayasgalan, https://github.com/albanD	2024-02-16 19:29:36 +00:00

1 2 3 4 5 ...

2460 Commits