pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Yu, Guangye	19a83eacb5	add new API torch.amp.is_autocast_available (#124938 ) # Motivation expose `torch._is_autocast_available` to `torch.amp.is_autocast_available` as a public api. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124938 Approved by: https://github.com/albanD	2024-04-26 08:45:20 +00:00
PyTorch MergeBot	f131c2c199	Revert "Fix mypy issues in fake_tensor.py (#124428 )" This reverts commit `25c0d3f3f0`. Reverted https://github.com/pytorch/pytorch/pull/124428 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
PyTorch MergeBot	1ac60484c1	Revert "Fix global flake8 issues (#124771 )" This reverts commit `f01275934b`. Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
Yu, Guangye	cdc66e9dc3	refactor autocast python APIs (#124479 ) # Motivation Refactor autocast usage scenario in `torch/amp/autocast_mode.py` and `torch/utils/checkpoint.py` to fix the bug - convention conflict between `torch.xxx.get_autocast_xxx_dtype` defined in `autocast_mode.py` and `torch.xxx.get_autocast_dtype` defined in `checkpoint.py`. # Solution Use device-agnostic APIs like `torch.get_autocast_dtype`, ..., instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124479 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #124359	2024-04-25 14:33:33 +00:00
Aaron Orenstein	f01275934b	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-25 14:25:00 +00:00
Aaron Orenstein	25c0d3f3f0	Fix mypy issues in fake_tensor.py (#124428 ) fake_tensor.py had mypy error ignored. That seems less than desirable. Also added SafePyObjectT<T> which is a tagged wrapper around a SafePyObject but provides static type checking (with no other guarantees). Used `SafePyObjectT<TorchDispatchModeKey>` on some of the TorchDispatchModeTLS API to ensure that we don't accidentally inject a different type than expected into the stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124428 Approved by: https://github.com/malfet	2024-04-25 14:07:53 +00:00
Aaron Enye Shi	80ab062103	[MemoryViz] Improve description of blocks with missing frames (#124784 ) Summary: It is common for blocks to be missing frames and there are many users asking why. Let's improve this output message to cover common reasons: 1) block was allocated before _record_memory_history was enabled 2) context or stacks passed to _record_memory_history does not include this block 3) backward events allocated with C++ stack and will not show if stacks = python Test Plan: CI and ran it locally: ![image](https://github.com/pytorch/pytorch/assets/17602366/60a03a22-0e3e-43d8-9ee7-b14358096fc7) Differential Revision: D56490921 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/124784 Approved by: https://github.com/zdevito	2024-04-24 21:16:31 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	674e15ae07	Back out "Switch to predispatch" (#124860 ) Summary: Original commit changeset: 1f155b3a0bfc Original Phabricator Diff: D56273267 Test Plan: CI Differential Revision: D56526505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124860 Approved by: https://github.com/angelayi	2024-04-24 17:28:33 +00:00
Tugsbayasgalan Manlaibaatar	c933af2709	Switch to predispatch (#123573 ) This PR switches export IR from aot-dispatch to pre-dispatch IR. What is pre-dispatch IR and why should you care? Currently the default IR returned by torch.export can contain only functional ATen operators after ALL pytorch dispatcher decompositions (for example, CompositeImplicitAutograd) run. In contrast, pre-dispatch IR refers to an IR that can contain all functional ATen operators (i.e., not just from the core subset), before any decomposition happens, as well as operators that manipulate autograd state. Pre-dispatch IR closely resembles eager PyTorch computation, but is still functional and serializable by torch.export. As a result: - You can train the pre-dispatch IR in eager mode as the IR contains necessary information for the autograd engine to automatically generate a backward graph. - You can write sound graph transformations more easily as the IR is functional. - Since it is an ATen IR, it is still normalized. For example, torch.add has multiple overloads, but aten.add.Tensor is unique in this IR. If you want to get the core aten IR out of `torch.export`, you will need to: ``` ep = torch.export.export(M(), inputs) ep_for_core_aten = ep.run_decompositions() ``` Differential Revision: [D56273267](https://our.internmc.facebook.com/intern/diff/D56273267) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123573 Approved by: https://github.com/gmagogsfm	2024-04-24 00:51:09 +00:00
Jeff Daily	a21327e0b0	[ROCm] update hipDataType support and hipify mappings (#120751 ) The hipDataType support and mappings are now up to date as of ROCm 5.7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120751 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-04-23 23:21:56 +00:00
Yu, Guangye	25f321b84f	Refactor autocast C++ APIs to be device-agnostic (#124359 ) # Motivation This PR aims to refactor autocast C++ APIs to be device-agnostic and deprecate the device-specific autocast C++ APIs. In C++ side, - `is_enabled()` -> `is_enabled(device_type)`. - `set_enabled(new_enabled)` -> `set_enabled(device_type, new_enabled)`. - `get_autocast_dtype()` -> `get_autocast_dtype(device_type)` - `set_autocast_dtype(dtype)` -> `set_autocast_dtype(device_type, dtype)` These following C++ APIs are deprecated and should be removed in PyTorch 2.5 - `is_cpu_enabled` - `set_cpu_enabled` - `get_autocast_cpu_dtype` - `set_autocast_cpu_dtype` - `is_xpu_enabled` - `set_xpu_enabled` - `get_autocast_xpu_dtype` - `set_autocast_xpu_dtype` - `is_ipu_enabled` - `set_ipu_enabled` - `get_autocast_ipu_dtype` - `set_autocast_ipu_dtype` - `is_hpu_enabled` - `set_hpu_enabled` - `get_autocast_hpu_dtype` - `set_autocast_hpu_dtype` - `is_xla_enabled` - `set_xla_enabled` - `get_autocast_xla_dtype` - `set_autocast_xla_dtype` - `is_privateuseone_enabled` - `set_privateuseone_enabled` - `get_autocast_privateuseone_dtype` - `set_autocast_privateuseone_dtype` In Python side, provide 4 generic autocast APIs: - `torch.is_autocast_enabled(device_type)` - `torch.set_autocast_enabled(device_type, new_enabled)` - `torch.get_autocast_dtype(device_type)` - `torch.set_autocast_dtype(device_type, dtype)` # Additional Context We will submit another PR to refactor autocast Python APIs based on this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124359 Approved by: https://github.com/jgong5, https://github.com/albanD	2024-04-23 10:38:50 +00:00
haozhe.zhu	3c964ad1ca	add fused_sgd_kernel support for CPU device (#123629 ) Support fused_sgd_kernel support for CPU. ## Bench result: 32 core/sockets ICX Test Scripts: https://gist.github.com/zhuhaozhe/688763e17e93e4c5e12f25f676ec90d9 https://gist.github.com/zhuhaozhe/ad9938694bc7fae8b66d376f4dffc6c9 ``` Tensor Size: 262144, Num Tensor 4, Num Threads: 1 _single_tensor_sgd time: 0.2301 seconds _fused_sgd time: 0.0925 seconds Tensor Size: 4194304, Num Tensor 32, Num Threads: 32 _single_tensor_sgd time: 2.6195 seconds _fused_sgd time: 1.7543 seconds ``` ## Test Plan: ``` python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_optim.py -k test_can_load_older_state_dict python test_optim.py -k test_grad_scaling_autocast_fused_optimizers python test_torch.py -k test_grad_scaling_autocast_fused python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step ``` Looks like we already have some PRs under this issue https://github.com/pytorch/pytorch/issues/123451 to unified the UTs, I did not modified UT in this PR. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123629 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-23 08:28:19 +00:00
Nikita Shulga	de5d689cf9	[EZ] Update pillow to 10.3.0 (#124614 ) As older versions as subject to [CVE-2024-28219](https://nvd.nist.gov/vuln/detail/CVE-2024-28219), although it's not super important from CI PoV Modernize `torch/utils/tensorboard/summary.py` to use Pillow-9+ APIs (is this file even used for anything anymore?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124614 Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi	2024-04-23 03:22:23 +00:00
Jeff Daily	6ede882c0b	preferred blas library; cublaslt gemm implementation (#122106 ) Following the example of PyTorch supporting a preferred Linalg library (cusolver or magma), this PR introduces a preferred blas library selector of either cublas or cublaslt for CUDA and hipblas or hipblaslt for ROCm via normal hipification of sources. The default blas implementation remains cublas or hipblas. cublaslt or hipblaslt can be enabled using environment variable TORCH_BLAS_PREFER_CUBLASLT=1 (or TORCH_BLAS_PREFER_HIPBLASLT=1 as an alias) or by calling `torch.backends.cuda.preferred_blas_library(backend="cublaslt")` or as an alias `backend="hipblaslt"`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122106 Approved by: https://github.com/lezcano	2024-04-22 15:38:22 +00:00
Aaron Gokaslan	c5fafe9f48	[BE]: TRY002 - Ban raising vanilla exceptions (#124570 ) Adds a ruff lint rule to ban raising raw exceptions. Most of these should at the very least be runtime exception, value errors, type errors or some other errors. There are hundreds of instance of these bad exception types already in the codebase, so I have noqa'd most of them. Hopefully this error code will get commiters to rethink what exception type they should raise when they submit a PR. I also encourage people to gradually go and fix all the existing noqas that have been added so they can be removed overtime and our exception typing can be improved. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124570 Approved by: https://github.com/ezyang	2024-04-21 22:26:40 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
cdzhan	f8f7cfbeee	Add __torch_function__ support for generated tensor methods/property of PrivateUse1 (#121723 ) support following case: ```python import torch ... class CustomFooTensor(torch.Tensor): @classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): ... a = CustomFooTensor([3]) print(a.is_foo) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121723 Approved by: https://github.com/albanD	2024-04-19 22:34:34 +00:00
rzou	a78450a00b	Excise uses of the old custom ops APIs (#124134 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124134 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299	2024-04-19 17:56:26 +00:00
rraminen	803a08f8ae	[ROCm] Add cublasGemmAlgo_t -> hipblasGemmAlgo_t (#121030 ) This PR is to add cublasGemmAlgo_t -> hipblasGemmAlgo_t to cuda_to_hip_mappings.py. It is required for DeepSpeed transformer extension build on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121030 Approved by: https://github.com/jeffdaily, https://github.com/ezyang	2024-04-19 02:57:16 +00:00
Xuehai Pan	93e249969b	[BE] enable `ruff` rule `RSE` and remove useless parentheses in `raise` statements (#124261 ) Remove useless parentheses in `raise` statements if the exception type is raised with no argument. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124261 Approved by: https://github.com/albanD	2024-04-17 19:29:34 +00:00
Xuehai Pan	2e48f7b044	[pytree] add `tree_iter` function (#123913 ) - Add a new `tree_iter` function. - Bump `optree` version to `0.11.0` for C++ version of `tree_iter`. This PR is split from #120300. - #120300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123913 Approved by: https://github.com/zou3519	2024-04-16 06:02:08 +00:00
Xuehai Pan	9bb54c7f3c	[pytree] enable `functools.wraps` in Python pytree with dynamo (#124012 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124012 Approved by: https://github.com/Skylion007	2024-04-14 09:25:05 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
Edward Z. Yang	efa36ef092	Natively support int truncation, don't guard on positive/negative (#122827 ) This doesn't entirely fix the original problem that prompted this, but it seems to just be getting stuck in export constraint formatting now which seems like progress to me. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122827 Approved by: https://github.com/avikchaudhuri	2024-04-11 15:22:32 +00:00
chilli	84580f76d9	fix flop counter issue with out parameters (#123768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123768 Approved by: https://github.com/zou3519	2024-04-11 09:39:53 +00:00
Edward Z. Yang	26a9b05bce	Set stacklevel on checkpoint warning (#123717 ) Partially addresses https://github.com/pytorch/pytorch/issues/123626 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123717 Approved by: https://github.com/Skylion007	2024-04-10 17:25:06 +00:00
Matěj Kripner	bd59e1113d	Improve docstring for tensorboard add_embedding() (#120408 ) Fixes missing parameter documentation (`metadata_header`). Fixes a typo. Adds a note explaining a somewhat confusing behavior of Tensorboard Projector where categorical values with more than 50 unique values are not permitted to be used for coloring. This was not documented anywhere. The confusion caused https://github.com/tensorflow/tensorboard/issues/61. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120408 Approved by: https://github.com/albanD	2024-04-10 00:32:29 +00:00
Angela Yi	1be2126ff6	[pytree] Fix namedtuple serialization (#123388 ) Summary: Previously we were serializing namedtuple treespecs incorrectly: ```python Point = namedtuple("Point", ["x", "y"]) p = Point(1, 2) flat, spec = pytree.tree_flatten(p) print(flat) # [1, 2] print(spec) # TreeSpec(type=namedtuple, context=Point, children=[, ]) dumped_spec = pytree.treespec_dumps(spec) print(dumped_spec) """ We only serialize the name of the class and the fields of the namedtuple: TreeSpec { type='collections.namedtuple', context={class_name='Point', class_fields={'x', 'y'}}, children=[Leaf, Leaf] } """ reconstructed_spec = pytree.treespec_loads(dumped_spec) print(reconstructed_spec) """ When we load, we create a new namedtuple class containing the same fields as before, but the is class is now a completely different class than the original one: TreeSpec(type=namedtuple, context=torch.utils._pytree.Point, children=[, ]) """ spec == reconstructed_spec # False ``` So, we introduce a new API called `pytree._register_namedtuple` where users can pass in the serialized name for each namedtuple class: ```python Point = namedtuple("Point", ["x", "y"]) pytree._register_namedtuple(Point, "Point") p = Point(1, 2) flat, spec = pytree.tree_flatten(p) print(flat) # [1, 2] print(spec) # TreeSpec(type=namedtuple, context=Point, children=[, ]) dumped_spec = pytree.treespec_dumps(spec) print(dumped_spec) """ TreeSpec { type='collections.namedtuple', context='Point', children=[Leaf, Leaf] } """ reconstructed_spec = pytree.treespec_loads(dumped_spec) print(reconstructed_spec) # TreeSpec(type=namedtuple, context=Point, children=[, ]) spec == reconstructed_spec # True ``` Test Plan: `python test/test_pytree.py` Differential Revision: D55771058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123388 Approved by: https://github.com/zou3519	2024-04-08 20:55:19 +00:00
lezcano	7ce42ebd44	Generalise mod value ranges (#123253 ) We also add the usual comment where we note that we don't handle negative values in mod properly. We should also fix this in the definition of ModularIndexing. I'll do that in a later PR, as for that one I'll also need to fix a number of tests that are testing an incorrect behaviour. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123253 Approved by: https://github.com/peterbell10	2024-04-06 20:19:24 +00:00
xinan.lin	9743e3a19c	[Inductor Intel GPU backend Upstream] Add Inductor Intel GPU backend. (#121895 ) As the design in RFC https://github.com/pytorch/pytorch/issues/114856, this PR implemented Intel GPU Inductor backend by: - Reuse WrapperCodegen and TritonScheduling for python wrapper and kernel code generation. And implenented device-specific code generation in XPUDeviceOpOverrides - Reuse fx_pass, lowering, codecache, triton kernel auto-tuning, and compilation. For the test case, this PR provided test/inductor/test_xpu_basic.py for basic inductor backend functionality testing. We'll reuse all the existing Inductor test case in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121895 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire	2024-04-05 09:05:11 +00:00
rzou	44c0c0fc0f	Add torch.library.custom_op (#122344 ) This is the entrypoint for defining an opaque/blackbox (e.g. PyTorch will never peek into it) custom op. In this PR, you can specify backend impls and the abstract impl for this op. NB: most of this PR is docstrings, please don't be intimidated by the line count. There are a number of interesting features: - we infer the schema from type hints. In a followup I add the ability to manually specify a schema. - name inference. The user needs to manually specify an op name for now. In a followup we add the ability to automatically infer a name (this is a little tricky). - custom_op registrations can override each other. This makes them more pleasant to work with in environments like colab. - we require that the outputs of the custom_op do not alias any inputs or each other. We enforce this via a runtime check, but can relax this into an opcheck test if it really matters in the future. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/122344 Approved by: https://github.com/ezyang, https://github.com/albanD	2024-04-03 18:36:17 +00:00
Yifu Wang	eb3a34d280	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-04-03 05:54:49 +00:00
Nikita Shulga	35c493f2cf	[CPP Extension] Escape include paths (#122974 ) By using `shlex.quote` on Linux/Mac and `_nt_quote_args` on Windows Test it by adding non-existent path with spaces and single quote TODO: Fix double quotes on Windows (will require touching `_nt_quote_args`, so will leave it for another day Fixes https://github.com/pytorch/pytorch/issues/122476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122974 Approved by: https://github.com/Skylion007	2024-03-30 21:58:29 +00:00
Yu, Guangye	eb7adc3ae0	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-30 13:04:38 +00:00
Edward Z. Yang	3178ba0dc9	Don't use sympy Float functions, use an opaque one with no reasoning (#122823 ) Sympy simplifications don't obey floating point semantics, so don't use Sympy for this. Keep them as is, only evaluate with the reference implementations when all arguments are known. This may end up getting subsumed by some other changes later, but I wanted to understand if this was easy and it seems to be easy. This doesn't actually depend on the earlier diffs on the stack and I can detach it. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122823 Approved by: https://github.com/lezcano	2024-03-29 19:13:55 +00:00
fzyzcjy	1e8d4b389b	Super tiny fix typo (#122881 ) "CustoType" -> "CustomType" Pull Request resolved: https://github.com/pytorch/pytorch/pull/122881 Approved by: https://github.com/awgu	2024-03-28 16:13:25 +00:00
PyTorch MergeBot	bef01c7c2b	Revert "Optimize multi_tensor_apply (take 2) (#119764 )" This reverts commit `fe41ba4765`. Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2024105399))	2024-03-27 22:42:07 +00:00
Andrew Gu	1d6fc0d4de	Fixed `_infer_device_type` warning in `checkpoint` (#122726 ) Previously, we were checking `len(device_types)` where `device_types` is a `list`. This meant that if there were multiple inputs, we would see something like `device_types = ["cuda", "cuda"]` and a false positive warning. We should check `len(set(device_types))`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122726 Approved by: https://github.com/soulitzer	2024-03-27 18:38:42 +00:00
Yifu Wang	fe41ba4765	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-03-27 00:51:30 +00:00
Edward Z. Yang	f42818321b	Restore DILL_AVAILABLE for backwards compat with torchdata (#122616 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/122616 Approved by: https://github.com/peterbell10	2024-03-26 02:18:51 +00:00
PyTorch MergeBot	5e0440edb4	Revert "Optimize multi_tensor_apply (take 2) (#119764 )" This reverts commit `0b68a28c87`. Reverted https://github.com/pytorch/pytorch/pull/119764 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing ROCm job in trunk `0b68a28c87`. Please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/119764#issuecomment-2014190124))	2024-03-22 02:18:28 +00:00
PyTorch MergeBot	968c4c4154	Revert "Refactor gpu trace to be device-agnostic (#121794 )" This reverts commit `74deacbf31`. Reverted https://github.com/pytorch/pytorch/pull/121794 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it breaks ROCm jobs in trunk `74deacbf31`, please help take a look and reland the change ([comment](https://github.com/pytorch/pytorch/pull/121794#issuecomment-2013674083))	2024-03-21 20:33:17 +00:00
Yifu Wang	0b68a28c87	Optimize multi_tensor_apply (take 2) (#119764 ) ### Take 2 The first take (#119153) landed but was reverted because it broke cuda graph for `multi_tensor_apply`. This PR is a reland of #119153: - Incorporate #119652 so that the optimization can be applied (1) without increasing binary size (2) to all 3 MTA variants without much code duplication. - Ensure the optimization is compatible with cuda graph. ### Summary Due to the dynamic nature of the workload, the kernel arguments aren't guaranteed to fit in the static 4kb kernel argument memory. Previously with the apex implementation, we overcame this limitation by dividing a multi_tensor_apply workload into multiple kernel launches. However, this led to low sustained occupancy, affecting the performance of memory bound ops. Based on the observation that the kernel argument memory limitation doesn't correlate well with available SM resources, we adopt a different approach: - When the kernel arguments fit into the static kernel argument memory, we use this memory to transfer the arguments. - Conversely, when the kernel arguments don't fit into the static kernel argument memory, instead of sacrificing sustained occupancy, we use a page-locked cudaMemcpyAsync to transfer the arguments, then perform the entire workload in a single kernel. This PR only covers `multi_tensor_apply` for tensors. The change can be easily applied to `multi_tensor_apply` for tensors + scalars and `multi_tensor_apply_for_fused_optimizer`. ### Benchmark (WIP) The only benchmark I've conducted so far on `_foreach_copy_` on a set of sizes that resembles internal workload. I need to benchmarks on more problem sizes. The speedup should vary among problem sizes. However, I believe this PR should not be slower than the previous impl on any problem sizes. The benchmark can be reproduced with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa). Baseline A single iteration in trace: <img width="831" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/5c8d72d0-0628-4989-88a8-c756f6bc1319"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_5a59145f-567b-472f-8eef-c61c388d45b4.json device ms: 1.111, cpu ms: 7.151 memory bandwidth: 1169.825 GB/s ``` This PR A single iteration in trace: <img width="967" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/a023e183-8166-48f7-b7c0-c8ba32653d2b"> ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_da060725-62a8-466e-b570-2ad67ff0e29d.json device ms: 0.892, cpu ms: 0.810 memory bandwidth: 1456.744 GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119764 Approved by: https://github.com/eqy, https://github.com/eellison, https://github.com/crcrpar	2024-03-21 11:53:31 +00:00
Yu, Guangye	74deacbf31	Refactor gpu trace to be device-agnostic (#121794 ) # Motivation Refactor gpu trace to be device-agnostic. gpu trace is usually used in runtime components, including Device, Stream, Event, Guard, and Allocator. It should be device-agnostic and can be shared among each device backend. # Solution move `_cuda_trace.py` to `_gpu_trace.py`, which makes each device backend owns their callback, respectively. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121794 Approved by: https://github.com/jgong5, https://github.com/albanD, https://github.com/EikanWang, https://github.com/gujinghui	2024-03-21 01:52:58 +00:00
Jing Shan	57734202c6	[HSTU][TGIF] Provide a API to check whether running in torch_dispatch mode (#122339 ) Summary: We provide a `is_in_torch_dispatch_mode` API returning `bool` to determine whether the program is running in torch dispatch mode or not. Test Plan: - OSS CI - Tested with publish of hstu models with the this diff and following diffs D54964288, D54964702, D54969677, D55025489, runtime errors are not raised anymore in publish Differential Revision: D55091453 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122339 Approved by: https://github.com/jiayisuse	2024-03-21 01:37:23 +00:00
Pritam Damania	512251c8f3	Use tree_map to get device ids and device types for activation checkpointing (#121462 ) `get_device_states` doesn't recursively look into nested lists/dicts to find tensors. As a result, activation checkpointing for such inputs results in silent incorrect results as `get_device_states` returns an empty result and no rng is saved as a result here: https://github.com/pytorch/pytorch/blob/main/torch/utils/checkpoint.py#L188 since `fwd_device_states` is empty. Fixed this by using `tree_map` for both `get_device_states` and `_infer_device_type`. Also added appropriate unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121462 Approved by: https://github.com/soulitzer	2024-03-20 21:09:21 +00:00
Jing Shan	d813474363	[Pytorch] auto format _python_dispatch file (#122226 ) Summary: Auto format the _python_dispatch file, to make D55091453 easier to review Test Plan: `arc lint` Differential Revision: D55091454 Pull Request resolved: https://github.com/pytorch/pytorch/pull/122226 Approved by: https://github.com/aakhundov	2024-03-20 19:28:39 +00:00
amoskvic	072935917b	Update cuda_to_hip_mappings.py (#122110 ) Added one datatype mapping (cuda_bf16.h), and a number of cub/hipcub mappings. Note: the missing mappings were discovered when hipifying the Mamba model's (https://github.com/state-spaces/mamba) forward kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122110 Approved by: https://github.com/jithunnair-amd, https://github.com/Skylion007	2024-03-20 17:17:53 +00:00
Oguz Ulgen	c0b2e56c8f	Support triton.language.dtype with torch.compile -- Second Attempt (#122141 ) This PR is the second attempt at supporting `triton.language.dtype`, now instead of putting it on the graph, we put it on the side table since it is a constant. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122141 Approved by: https://github.com/jansel ghstack dependencies: #122140	2024-03-19 19:40:52 +00:00

1 2 3 4 5 ...

1836 Commits