pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	fd182b90a7	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit `d45b0151e5`. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))	2024-09-23 19:57:13 +00:00
Kurt Mohler	d45b0151e5	Add deterministic path for CUDA `cumsum` (#136224 ) Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA. Fixes #89492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224 Approved by: https://github.com/ezyang, https://github.com/justinchuby	2024-09-20 02:41:56 +00:00
Nikita Shulga	cd5452aace	[CUDA] `is_bf16_supported()` should not crash if there are no GPUs (#132313 ) `False` is the good answer on a system that does not have any CUDA GPUs. - Added regression test to TestTorch. Fixes https://github.com/pytorch/pytorch/issues/132303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132313 Approved by: https://github.com/eqy, https://github.com/syed-ahmed	2024-08-02 02:50:43 +00:00
Songhao Jia	a141334c88	migitate wrong tensor.dim_order() (#131366 ) Summary: there're some issues for dim order creation. T194410923 has detail illustration. One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut. To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it. Test Plan: CI Differential Revision: D60056793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366 Approved by: https://github.com/ezyang	2024-07-30 21:58:15 +00:00
Aaron Orenstein	5a0068cc69	[BE] mypy: disallow untyped decorators (#131428 ) Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations. Step 1 - Enable the error and override in all the offending files. #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428 Approved by: https://github.com/justinchuby, https://github.com/oulgen	2024-07-23 21:50:55 +00:00
wizzniu	8963623494	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-23 01:44:15 +00:00
PyTorch MergeBot	726b9268d2	Revert "Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 )" This reverts commit `c986aeea2d`. Reverted https://github.com/pytorch/pytorch/pull/126376 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/126376#issuecomment-2237496633))	2024-07-18 20:25:20 +00:00
wizzniu	c986aeea2d	Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376 ) This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods. Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods. Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed. Relates #124908 Relates #14560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376 Approved by: https://github.com/albanD	2024-07-18 11:54:14 +00:00
eellison	9ab8d47f9d	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-16 00:17:11 +00:00
PyTorch MergeBot	9df4bc6a0d	Revert "Constant folding for dynamic shape node (#129686 )" This reverts commit `b7d287fbec`. Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally. Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))	2024-07-15 15:19:24 +00:00
eellison	b7d287fbec	Constant folding for dynamic shape node (#129686 ) Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding. Taken over from https://github.com/pytorch/pytorch/pull/128937 joint work with @imzhuhl Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686 Approved by: https://github.com/Chillee ghstack dependencies: #130367	2024-07-12 03:44:29 +00:00
cyy	cb5e9183c6	[Caffe2] [2/N] Remove Caffe2 from tests (#128911 ) Follows #128675 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128911 Approved by: https://github.com/titaiwangms, https://github.com/r-barnes	2024-06-19 00:05:50 +00:00
Aaron Orenstein	dcfa7702c3	Flip default value for mypy disallow_untyped_defs [1/11] (#127838 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838 Approved by: https://github.com/oulgen	2024-06-08 18:16:33 +00:00
Aaron Gokaslan	12c4a2c297	[BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716 ) Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716 Approved by: https://github.com/ezyang	2024-06-03 17:22:13 +00:00
Xuehai Pan	67ef2683d9	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#127689 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. Resolves #126888 - #126888 This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689 Approved by: https://github.com/Skylion007	2024-06-02 12:30:43 +00:00
PyTorch MergeBot	033e733021	Revert "[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 )" This reverts commit `749a132fb0`. Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))	2024-05-31 19:47:24 +00:00
Mikayla Gawarecki	cd06ae0cb8	Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference (#127313 ) ### Before this PR: `torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1 ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward # torch.utils.swap_tensors(a, b) del out # Calling swap_tensors here would pass torch.utils.swap_tensors(a, b) ``` ### After this PR: `torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad` A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph). ```python a = torch.randn(2, 3, requires_grad=True) b = torch.randn(2, 4) out = a * 2 out.sum().backward() # Calling swap_tensors here is ok torch.utils.swap_tensors(a, b) # If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors ``` ### Application to `nn.Module` This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node ```python import torch import torch.nn as nn m = nn.Linear(3, 5) inp = torch.randn(2, 3) out = m(inp) out.sum().backward() m.cpu() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313 Approved by: https://github.com/soulitzer	2024-05-30 07:06:55 +00:00
Xuehai Pan	749a132fb0	[BE] wrap deprecated function/class with `typing_extensions.deprecated` (#126898 ) Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing. Note that only warnings that their messages contain `[Dd]eprecat(ed\|ion)` are updated in this PR. UPDATE: Use `FutureWarning` instead of `DeprecationWarning`. Resolves #126888 - #126888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898 Approved by: https://github.com/albanD	2024-05-29 12:09:27 +00:00
William Wen	5359af0c7e	[dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341 ) Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-05-29 05:18:04 +00:00
Yu, Guangye	c09205a057	Deprecate device-specific GradScaler autocast API (#126527 ) # Motivation ## for `torch.amp.GradScaler`, - `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`. - `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`. So, we intend to depreate them and strongly recommend developer to use `torch.amp.GradScaler`. ## for `custom_fwd` and `custom_bwd`, this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU. So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`. # Additional Context Add UT to cover the deprecated warning. No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them. To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang	2024-05-25 06:41:34 +00:00
Matthew Hoffman	86ad101370	Enable pickling `torch._C.Generator` (#126271 ) Fixes #71398 Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`. `__reduce__` returns a tuple of 3 values: 1. `torch.Generator` itself. 2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created. 3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor. `__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state. Added test demonstrating successful reserialization with cpu and cuda `Generator`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271 Approved by: https://github.com/ezyang	2024-05-22 14:38:47 +00:00
Nikita Shulga	a379ed6e98	Fix SobolEngine default dtype handling (#126781 ) - Change default dtype argument to `None` and fetch it value via `torch.get_default_dtype()` call if not defined - Fix bug in first draw handling logic, that would ignore dtype in favor of default one due to type promotion - Add regression tests Fixes https://github.com/pytorch/pytorch/issues/126478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/126781 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-05-22 01:55:48 +00:00
Tarunbir Gambhir	ad67553c5c	Updated test_torch.py to use new OptimizerInfo infrastructure (#125538 ) Fixes #123451 (only addresses test_torch.py cases) This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure. I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations. ``` $ lintrunner test/test_cuda.py ok No lint issues. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125538 Approved by: https://github.com/janeyx99	2024-05-18 15:42:45 +00:00
albanD	19a9de114a	Forbid subclassing _TensorBase directly (#125558 ) As per title. This ensures that all the places where we assume the method defined in _tensor.py do exist. BC-Breaking: This is bc-breaking as the user cannot subclass this private class anymore. You should replace any use of _TensorBase to Tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/125558 Approved by: https://github.com/ezyang	2024-05-08 20:29:29 +00:00
PyTorch MergeBot	a0e2f62edd	Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809 )" This reverts commit `9e24c263f9`. Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2091751002))	2024-05-02 21:36:18 +00:00
Danial Javady	9e24c263f9	Include support for the scatter gather cuda kernels to allow for comp… (#124809 ) Fixes #121965 This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated. C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing. Please keep the following in mind: 1) I think this is my first time using Pytorch. 2) This is my first contribution to Pytorch. Environment: 3080 & WSL 2. `nvcc` is at 12.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809 Approved by: https://github.com/mikaylagawarecki	2024-05-01 23:58:35 +00:00
PyTorch MergeBot	4d410155b2	Revert "Include support for the scatter gather cuda kernels to allow for comp… (#124809 )" This reverts commit `e09f98c705`. Reverted https://github.com/pytorch/pytorch/pull/124809 on behalf of https://github.com/clee2000 due to windows build failure is real, https://github.com/pytorch/pytorch/actions/runs/8910674030/job/24470387612#step:11:11236 is the correct failure line, ignore the statement saying build passed, batch is errorcodes arent propagating again ([comment](https://github.com/pytorch/pytorch/pull/124809#issuecomment-2088680371))	2024-05-01 16:02:02 +00:00
Danial Javady	e09f98c705	Include support for the scatter gather cuda kernels to allow for comp… (#124809 ) Fixes #121965 This PR hopes to add support complex numbers in the scatter/gather related kernels. For brevity, I will only include `complex<float>` for now as `complex<double>`, for example, will be more complicated. C++ unit tests are currently passing alongside tests in `test_scatter_gather_ops.py`. Python test suites also seem to be passing. Please keep the following in mind: 1) I think this is my first time using Pytorch. 2) This is my first contribution to Pytorch. Environment: 3080 & WSL 2. `nvcc` is at 12.4. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124809 Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki	2024-05-01 14:31:31 +00:00
Yifu Wang	91a4740e72	Disable the CUDA fast path for split_with_sizes_copy when capturing (#125052 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125052 Approved by: https://github.com/awgu, https://github.com/eellison, https://github.com/eqy	2024-04-27 07:59:39 +00:00
Aaron Orenstein	a8574a9719	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-26 15:35:53 +00:00
PyTorch MergeBot	1ac60484c1	Revert "Fix global flake8 issues (#124771 )" This reverts commit `f01275934b`. Reverted https://github.com/pytorch/pytorch/pull/124771 on behalf of https://github.com/jeanschmidt due to Unfortunately, I needed to revert #123735 and this one depends on it. So please check if there are no merge conflicts or breakages and feel free to merge this PR again ([comment](https://github.com/pytorch/pytorch/pull/124428#issuecomment-2078699836))	2024-04-26 06:15:17 +00:00
Aaron Orenstein	f01275934b	Fix global flake8 issues (#124771 ) Prior to this `lintrunner --all-files --take FLAKE8` failed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124771 Approved by: https://github.com/Skylion007 ghstack dependencies: #124428	2024-04-25 14:25:00 +00:00
Isuru Fernando	edcd968b51	Add out wrappers to some decompositions (#115437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437 Approved by: https://github.com/lezcano	2024-04-23 06:26:11 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Jane Xu	b412b75b42	[optim] add fused_adam/adamw_kernel support for CPU device (#123074 ) On par with `CUDA` implementation. For `autocast` logic, same with `CUDA` + `Fused Adam`: - check inf in `gradscalar.step` - In fused kernel, if there is `inf`, do nothing. If not, unscale the grad ( also write back) and update the param. TestPlan: ``` # extend CUDA only test for CPU fused adagrad python test_optim.py -k test_fused_matches_forloop python test_optim.py -k test_fused_large_tensor python test_torch.py -k test_grad_scaling_autocast_fused # extend fused test python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step python test_optim.py -k test_can_load_older_state_dict # newly added test (follow `6b1f13ea2f/test/test_cuda.py (L1108)`) python test_optim.py -k test_grad_scaling_autocast_fused_optimizers ``` Benchmark: 5.1x on 56 core SPR Parameter-size=1M Nparams=10 [test script](https://gist.github.com/zhuhaozhe/ef9a290ad3f8f4067b3373a3bdaa33e7) ``` numactl -C 0-55 -m 0 python bench_adam.py non-fused 6.0174267292022705 s fused 1.1787631511688232 s ``` Note: Fused kernel accuracy The accuracy failure in CI shows a little higher than default tolerance ``` 2024-04-02T06:09:16.2213887Z Mismatched elements: 21 / 64 (32.8%) 2024-04-02T06:09:16.2214339Z Greatest absolute difference: 1.5735626220703125e-05 at index (6, 6) (up to 1e-05 allowed) 2024-04-02T06:09:16.2214813Z Greatest relative difference: 1.0073336852656212e-05 at index (4, 1) (up to 1.3e-06 allowed) ``` I have debug it step by step and unfortunately we may not able to make the `fused kernel` exactly same with `non fused` one due to compiler optimizations. For example, in non-fused impl ``` exp_avg_sq.mul_(beta2).addcmul_(grad, grad.conj(), value=1 - beta2) ``` and in fused impl ``` exp_avg_sq_ptr[d] = scalar_t(beta2) * exp_avg_sq_ptr[d]; // std::cout << "exp_avg_sq " << exp_avg_sq_ptr[d] << std::endl; exp_avg_sq_ptr[d] = exp_avg_sq_ptr[d] + scalar_t(exp_avg_sq_grad_coefficient) * grad_val * grad_val; ``` If I keep `std::cout`, I can get exactly same results in UT ``` ===============param 0.6796758770942688 0.6796758770942688 ``` But when I comment out it, there will be a difference ``` ===============param 0.6796758770942688 0.6796759366989136 ``` So I will make the tolerance a little higher than default one. Co-authored-by: Jane Xu <janeyx@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123074 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-04-19 11:14:04 +00:00
Sam Larsen	6502c888cf	Enable fx graph cache in torch_test.py when using PYTORCH_TEST_WITH_INDUCTOR=1 (#122010 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/122010 Approved by: https://github.com/eellison	2024-03-19 02:17:10 +00:00
Kurt Mohler	13a54ce279	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-03-01 05:05:28 +00:00
PyTorch MergeBot	86ff31c4a0	Revert "Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 )" This reverts commit `cabc09a5f2`. Reverted https://github.com/pytorch/pytorch/pull/120455 on behalf of https://github.com/izaitsevfb due to breaks xla jobs ([comment](https://github.com/pytorch/pytorch/pull/120455#issuecomment-1970026100))	2024-02-28 22:30:18 +00:00
Kurt Mohler	cabc09a5f2	Avoid COW materialization in `at::parallel_for/parallel_reduce` (#120455 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120455 Approved by: https://github.com/albanD	2024-02-28 00:37:33 +00:00
Sergii Dymchenko	bd9db6a9c7	Update to TorchFix 0.4.0 (#119424 ) `torch.library.Library` updated to `torch.library._scoped_library` in files with many tests where it seems obvious to do, otherwise `noqa: TOR901` added - see https://github.com/pytorch/pytorch/pull/118318 for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119424 Approved by: https://github.com/zou3519	2024-02-12 23:30:12 +00:00
Hirochika Matsumoto	02c24b0b5e	Add Python binding `resizable` to class `{Untyped,Typed}Storage` (#119286 ) This PR exposes `resizable` method of `StorageImpl` to Python frontend to make it accessible for users. Fixes #119233 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119286 Approved by: https://github.com/ezyang, https://github.com/mikaylagawarecki	2024-02-07 19:15:55 +00:00
CaoE	113138aa55	add test cases for GradScaler on CPU (#109994 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/109994 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-02-02 21:49:07 +00:00
Yifu Wang	0f7e63620f	CUDA fast path for split_with_sizes_copy.out (#117203 ) ### Motivation In per-parameter sharding FSDP, each rank holds one shard of every parameter. Before a bucket of parameters is used, FSDP performs all-gather to reconstruct the full parameters. The following example demonstrates the process for `world_size=2`, `num_params=3` (`A`, `B`, `C` standands for values in param `A`, `B`, `C`): All-gather output: ``` AAAABBBCCAAAABBBCC ``` After all-gather-copy-out: ``` AAAAAAAA BBBBBB CCCC ``` The performance of all-gather-copy-out is crucial for the viability of per-parameter sharding FSDP. After thorough experiments, we believe that acceptable performance for this op is not achievable via composing existing ATen ops today. We have proven that ideal performance is achievable with a [custom kernel](https://github.com/pytorch/pytorch/pull/115515). This PR aims to incorporate the optimizations to appropriate ATen ops (as suggested by @albanD). ### all-gather-copy-out via Composing ATen Ops Carrying out the op out via composing ATen ops involves a combination of view ops and copy ops. After thorough experiments, we found that the most natural/performant way to express the op is via `split_with_sizes` + `_foreach_copy_`, which works as follows: Reshape all-gather output as (world_size, -1): ``` AAAABBBCC AAAABBBCC ``` `split_with_sizes` + `_foreach_copy_`: ``` AAAA BBB CC AAAA BBB CC ``` However, the performance of this approach is still far below that of the custom kernel. We've identified the following reasons: - The approach requires materializing `O(num_params)` intermediate views, which induces large amount of CPU overhead when `num_params` is high. - `_foreach_copy_` uses the same block size all tensors, leading to waste for small tensors and insufficient thread count for large tensors. This means low effective occupancy. - `_foreach_copy_` dispatches multiple kernels for typical problem sizes for all-gather-copy-out. This further lowers the effective occupancy. - Due to the nature of the workload, the underlying copies are unaligned. `_foreach_copy_` isn't aggressive enough in exploiting vectorization oppurtunities in such workloads. ### PR Introduces a CUDA backend for `split_with_sizes_copy.out` that addresses the above inefficiencies. See code for details. ### Benchmarks The benchmarks are conducted on a set of representative problems sizes on an A100. CPU overhead and GPU execution time is measured separately, as reasonable CPU overhead doesn't directly affect e2e throughput. The reported copy bandwidth is calculated with GPU execution time. Compared to the baseline, we observe 3x-10x higher throughput compared to the baseline depending on the problem size. We also observe lower CPU overhead across the board compared to the baseline. Baseline: ``` num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 67.564 GB/s (gpu ms/iter: 0.869, cpu ms/iter 10.460) num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 260.373 GB/s (gpu ms/iter: 5.582, cpu ms/iter 0.572) num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 239.585 GB/s (gpu ms/iter: 2.135, cpu ms/iter 0.587) num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 205.361 GB/s (gpu ms/iter: 0.976, cpu ms/iter 0.534) num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 268.397 GB/s (gpu ms/iter: 3.663, cpu ms/iter 0.084) num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 265.240 GB/s (gpu ms/iter: 3.024, cpu ms/iter 0.154) num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 268.918 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.087) num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 268.141 GB/s (gpu ms/iter: 8.384, cpu ms/iter 0.151) num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 73.237 GB/s (gpu ms/iter: 0.874, cpu ms/iter 10.664) num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 259.902 GB/s (gpu ms/iter: 5.609, cpu ms/iter 0.584) num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 238.703 GB/s (gpu ms/iter: 2.158, cpu ms/iter 0.612) num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 205.144 GB/s (gpu ms/iter: 0.987, cpu ms/iter 0.559) num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 270.467 GB/s (gpu ms/iter: 3.635, cpu ms/iter 0.073) num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 267.700 GB/s (gpu ms/iter: 2.997, cpu ms/iter 0.133) num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 268.913 GB/s (gpu ms/iter: 5.849, cpu ms/iter 0.093) num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 266.589 GB/s (gpu ms/iter: 8.433, cpu ms/iter 0.207) num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 135.107 GB/s (gpu ms/iter: 1.495, cpu ms/iter 10.904) num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 258.675 GB/s (gpu ms/iter: 5.890, cpu ms/iter 0.996) num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 238.919 GB/s (gpu ms/iter: 2.408, cpu ms/iter 0.765) num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 209.836 GB/s (gpu ms/iter: 1.172, cpu ms/iter 0.611) num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 270.607 GB/s (gpu ms/iter: 3.720, cpu ms/iter 0.100) num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 266.375 GB/s (gpu ms/iter: 3.071, cpu ms/iter 0.176) num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 270.601 GB/s (gpu ms/iter: 5.952, cpu ms/iter 0.099) num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 268.558 GB/s (gpu ms/iter: 8.371, cpu ms/iter 0.207) num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 43.749 GB/s (gpu ms/iter: 0.797, cpu ms/iter 10.531) num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 254.084 GB/s (gpu ms/iter: 3.781, cpu ms/iter 0.752) num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 216.792 GB/s (gpu ms/iter: 1.299, cpu ms/iter 0.717) num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 188.025 GB/s (gpu ms/iter: 0.793, cpu ms/iter 0.633) num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 267.793 GB/s (gpu ms/iter: 2.447, cpu ms/iter 0.107) num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 264.232 GB/s (gpu ms/iter: 2.401, cpu ms/iter 0.182) num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 268.455 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.089) num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 267.633 GB/s (gpu ms/iter: 6.394, cpu ms/iter 0.177) num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 46.698 GB/s (gpu ms/iter: 0.807, cpu ms/iter 10.488) num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 253.450 GB/s (gpu ms/iter: 3.799, cpu ms/iter 0.655) num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 216.857 GB/s (gpu ms/iter: 1.307, cpu ms/iter 0.671) num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 189.059 GB/s (gpu ms/iter: 0.799, cpu ms/iter 0.572) num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 269.849 GB/s (gpu ms/iter: 2.429, cpu ms/iter 0.078) num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 264.501 GB/s (gpu ms/iter: 2.399, cpu ms/iter 0.149) num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 268.426 GB/s (gpu ms/iter: 3.906, cpu ms/iter 0.086) num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 267.495 GB/s (gpu ms/iter: 6.398, cpu ms/iter 0.170) num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 101.151 GB/s (gpu ms/iter: 1.211, cpu ms/iter 10.476) num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 252.323 GB/s (gpu ms/iter: 3.963, cpu ms/iter 0.633) num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 218.322 GB/s (gpu ms/iter: 1.455, cpu ms/iter 0.622) num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 196.369 GB/s (gpu ms/iter: 0.944, cpu ms/iter 0.576) num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 269.369 GB/s (gpu ms/iter: 2.491, cpu ms/iter 0.076) num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 264.441 GB/s (gpu ms/iter: 2.439, cpu ms/iter 0.140) num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 269.955 GB/s (gpu ms/iter: 3.978, cpu ms/iter 0.073) num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 267.168 GB/s (gpu ms/iter: 6.405, cpu ms/iter 0.147) ``` New kernel: ``` num_params=150 world_size=8 mixed=True Param size: 0.059 GB Copy bandwidth: 560.946 GB/s (gpu ms/iter: 0.105, cpu ms/iter 1.066) num_params=54 world_size=8 mixed=True Param size: 1.453 GB Copy bandwidth: 732.657 GB/s (gpu ms/iter: 1.984, cpu ms/iter 0.417) num_params=54 world_size=8 mixed=True Param size: 0.512 GB Copy bandwidth: 753.514 GB/s (gpu ms/iter: 0.679, cpu ms/iter 0.419) num_params=50 world_size=8 mixed=True Param size: 0.200 GB Copy bandwidth: 719.400 GB/s (gpu ms/iter: 0.279, cpu ms/iter 0.410) num_params=3 world_size=8 mixed=True Param size: 0.983 GB Copy bandwidth: 782.121 GB/s (gpu ms/iter: 1.257, cpu ms/iter 0.098) num_params=9 world_size=8 mixed=True Param size: 0.802 GB Copy bandwidth: 766.458 GB/s (gpu ms/iter: 1.047, cpu ms/iter 0.134) num_params=3 world_size=8 mixed=True Param size: 1.573 GB Copy bandwidth: 790.611 GB/s (gpu ms/iter: 1.989, cpu ms/iter 0.099) num_params=9 world_size=8 mixed=True Param size: 2.248 GB Copy bandwidth: 789.754 GB/s (gpu ms/iter: 2.847, cpu ms/iter 0.138) num_params=150 world_size=128 mixed=True Param size: 0.064 GB Copy bandwidth: 565.667 GB/s (gpu ms/iter: 0.113, cpu ms/iter 0.996) num_params=54 world_size=128 mixed=True Param size: 1.458 GB Copy bandwidth: 670.681 GB/s (gpu ms/iter: 2.174, cpu ms/iter 0.289) num_params=54 world_size=128 mixed=True Param size: 0.515 GB Copy bandwidth: 676.135 GB/s (gpu ms/iter: 0.762, cpu ms/iter 0.264) num_params=50 world_size=128 mixed=True Param size: 0.203 GB Copy bandwidth: 662.603 GB/s (gpu ms/iter: 0.306, cpu ms/iter 0.249) num_params=3 world_size=128 mixed=True Param size: 0.983 GB Copy bandwidth: 769.283 GB/s (gpu ms/iter: 1.278, cpu ms/iter 0.078) num_params=9 world_size=128 mixed=True Param size: 0.802 GB Copy bandwidth: 761.057 GB/s (gpu ms/iter: 1.054, cpu ms/iter 0.104) num_params=3 world_size=128 mixed=True Param size: 1.573 GB Copy bandwidth: 774.325 GB/s (gpu ms/iter: 2.031, cpu ms/iter 0.075) num_params=9 world_size=128 mixed=True Param size: 2.248 GB Copy bandwidth: 773.048 GB/s (gpu ms/iter: 2.908, cpu ms/iter 0.099) num_params=150 world_size=1024 mixed=True Param size: 0.202 GB Copy bandwidth: 641.405 GB/s (gpu ms/iter: 0.315, cpu ms/iter 0.616) num_params=54 world_size=1024 mixed=True Param size: 1.524 GB Copy bandwidth: 646.772 GB/s (gpu ms/iter: 2.356, cpu ms/iter 0.276) num_params=54 world_size=1024 mixed=True Param size: 0.575 GB Copy bandwidth: 658.157 GB/s (gpu ms/iter: 0.874, cpu ms/iter 0.278) num_params=50 world_size=1024 mixed=True Param size: 0.246 GB Copy bandwidth: 642.032 GB/s (gpu ms/iter: 0.383, cpu ms/iter 0.245) num_params=3 world_size=1024 mixed=True Param size: 1.007 GB Copy bandwidth: 728.990 GB/s (gpu ms/iter: 1.381, cpu ms/iter 0.080) num_params=9 world_size=1024 mixed=True Param size: 0.818 GB Copy bandwidth: 689.763 GB/s (gpu ms/iter: 1.186, cpu ms/iter 0.102) num_params=3 world_size=1024 mixed=True Param size: 1.611 GB Copy bandwidth: 765.507 GB/s (gpu ms/iter: 2.104, cpu ms/iter 0.078) num_params=9 world_size=1024 mixed=True Param size: 2.248 GB Copy bandwidth: 757.626 GB/s (gpu ms/iter: 2.967, cpu ms/iter 0.106) num_params=150 world_size=8 mixed=False Param size: 0.035 GB Copy bandwidth: 584.272 GB/s (gpu ms/iter: 0.060, cpu ms/iter 0.656) num_params=54 world_size=8 mixed=False Param size: 0.961 GB Copy bandwidth: 728.234 GB/s (gpu ms/iter: 1.319, cpu ms/iter 0.264) num_params=54 world_size=8 mixed=False Param size: 0.282 GB Copy bandwidth: 730.059 GB/s (gpu ms/iter: 0.386, cpu ms/iter 0.279) num_params=50 world_size=8 mixed=False Param size: 0.149 GB Copy bandwidth: 670.899 GB/s (gpu ms/iter: 0.222, cpu ms/iter 0.274) num_params=3 world_size=8 mixed=False Param size: 0.655 GB Copy bandwidth: 775.699 GB/s (gpu ms/iter: 0.845, cpu ms/iter 0.077) num_params=9 world_size=8 mixed=False Param size: 0.634 GB Copy bandwidth: 773.612 GB/s (gpu ms/iter: 0.820, cpu ms/iter 0.112) num_params=3 world_size=8 mixed=False Param size: 1.049 GB Copy bandwidth: 781.395 GB/s (gpu ms/iter: 1.342, cpu ms/iter 0.081) num_params=9 world_size=8 mixed=False Param size: 1.711 GB Copy bandwidth: 789.156 GB/s (gpu ms/iter: 2.169, cpu ms/iter 0.116) num_params=150 world_size=128 mixed=False Param size: 0.038 GB Copy bandwidth: 517.056 GB/s (gpu ms/iter: 0.073, cpu ms/iter 0.632) num_params=54 world_size=128 mixed=False Param size: 0.963 GB Copy bandwidth: 684.246 GB/s (gpu ms/iter: 1.407, cpu ms/iter 0.294) num_params=54 world_size=128 mixed=False Param size: 0.283 GB Copy bandwidth: 680.593 GB/s (gpu ms/iter: 0.416, cpu ms/iter 0.286) num_params=50 world_size=128 mixed=False Param size: 0.151 GB Copy bandwidth: 682.197 GB/s (gpu ms/iter: 0.221, cpu ms/iter 0.255) num_params=3 world_size=128 mixed=False Param size: 0.655 GB Copy bandwidth: 759.470 GB/s (gpu ms/iter: 0.863, cpu ms/iter 0.074) num_params=9 world_size=128 mixed=False Param size: 0.634 GB Copy bandwidth: 765.694 GB/s (gpu ms/iter: 0.829, cpu ms/iter 0.094) num_params=3 world_size=128 mixed=False Param size: 1.049 GB Copy bandwidth: 766.535 GB/s (gpu ms/iter: 1.368, cpu ms/iter 0.075) num_params=9 world_size=128 mixed=False Param size: 1.711 GB Copy bandwidth: 787.608 GB/s (gpu ms/iter: 2.173, cpu ms/iter 0.105) num_params=150 world_size=1024 mixed=False Param size: 0.122 GB Copy bandwidth: 640.203 GB/s (gpu ms/iter: 0.191, cpu ms/iter 0.668) num_params=54 world_size=1024 mixed=False Param size: 1.000 GB Copy bandwidth: 713.947 GB/s (gpu ms/iter: 1.401, cpu ms/iter 0.274) num_params=54 world_size=1024 mixed=False Param size: 0.318 GB Copy bandwidth: 642.855 GB/s (gpu ms/iter: 0.494, cpu ms/iter 0.276) num_params=50 world_size=1024 mixed=False Param size: 0.185 GB Copy bandwidth: 643.297 GB/s (gpu ms/iter: 0.288, cpu ms/iter 0.262) num_params=3 world_size=1024 mixed=False Param size: 0.671 GB Copy bandwidth: 690.626 GB/s (gpu ms/iter: 0.972, cpu ms/iter 0.078) num_params=9 world_size=1024 mixed=False Param size: 0.645 GB Copy bandwidth: 754.431 GB/s (gpu ms/iter: 0.855, cpu ms/iter 0.109) num_params=3 world_size=1024 mixed=False Param size: 1.074 GB Copy bandwidth: 769.985 GB/s (gpu ms/iter: 1.395, cpu ms/iter 0.080) num_params=9 world_size=1024 mixed=False Param size: 1.711 GB Copy bandwidth: 766.337 GB/s (gpu ms/iter: 2.233, cpu ms/iter 0.103) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/117203 Approved by: https://github.com/albanD, https://github.com/awgu ghstack dependencies: #118512	2024-02-01 18:23:01 +00:00
CaoE	bacbad5bc9	add GradScaler on CPU (#109993 ) Step 2 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109993 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-29 23:42:35 +00:00
Edward Z. Yang	46712b019d	Enable local_partial_types (#118467 ) When using dmypy, this setting is enabled and cannot be turned off. Force it for regular mypy too. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/118467 Approved by: https://github.com/Skylion007 ghstack dependencies: #118414, #118418, #118432	2024-01-28 13:38:22 +00:00
Mikayla Gawarecki	41a56f7828	Fix swap_tensors to swap PyObjects associated with TensorImpl (#116955 ) This PR intends to fix the following issue when swapping two tensors ```python >>> import torch >>> torch.manual_seed(5) >>> t1 = torch.randn(2) >>> t2 = torch.randn(3) >>> t1 tensor([-0.4868, -0.6038]) >>> t2 tensor([-0.5581, 0.6675, -0.1974]) >>> torch.utils.swap_tensors(t1, t2) >>> t1 tensor([-0.5581, 0.6675, -0.1974]) >>> t2 tensor([-0.4868, -0.6038]) >>> t1.fill_(0.5) # t1 back to its unswapped state :o tensor([-0.4868, -0.6038]) ``` What happens here is that in `THPVariable_Wrap` (which is used when going back from C++ --> Python), we check if the TensorImpl of the tensor to be returned already has a pointer to a PyObject in its PyObject slot. If this is the case then this object is returned. `57491d2046/torch/csrc/autograd/python_variable.cpp (L271-L292)` When we run any operation that returns the same TensorImpl (e.g. inplace op, `t.to(dtype=t.dtype)`, etc.), although `t1` now has `t2`'s TensorImpl, `t2`'s TensorImpl still has a reference to `t2`, so when we do the op on `t1` and `THPVariable_Wrap` attempts to return the pointer to the TensorImpl's PyObject, we return a pointer to `t2` instead. The TensorImpl should have the PyObjects in their PyObjectSlots swapped as well in `swap_tensors` Pull Request resolved: https://github.com/pytorch/pytorch/pull/116955 Approved by: https://github.com/albanD	2024-01-24 01:40:18 +00:00
Kurt Mohler	cd084c4909	Add `TensorIteratorConfig::add_const_input` to avoid COW materialize (#118053 ) Part of #97856 Pull Request resolved: https://github.com/pytorch/pytorch/pull/118053 Approved by: https://github.com/ezyang	2024-01-23 22:32:39 +00:00
Oguz Ulgen	3b38f7b266	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 16:11:38 +00:00
PyTorch MergeBot	bb28965924	Revert "Remove skips for passing tests (#118000 )" This reverts commit `3c339b5b21`. Reverted https://github.com/pytorch/pytorch/pull/118000 on behalf of https://github.com/oulgen due to test passing on diff but failing on hud... ([comment](https://github.com/pytorch/pytorch/pull/118000#issuecomment-1905351752))	2024-01-23 06:10:25 +00:00
Oguz Ulgen	3c339b5b21	Remove skips for passing tests (#118000 ) These tests were already passing Pull Request resolved: https://github.com/pytorch/pytorch/pull/118000 Approved by: https://github.com/yanboliang	2024-01-23 03:41:23 +00:00
haozhe.zhu@intel.com	0ae952db76	enable mkldnn bf32 matmul (#116015 ) ### Testing FP32 matmul vs. mkldnn BF32 matmul on SPR single core: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 128, N: 128, K: 128, trans_a: False, trans_b: False \| 32.842 \| 38.279 \| 1.165 M: 128, N: 256, K: 128, trans_a: False, trans_b: False \| 38.590 \| 73.967 \| 1.917 M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 18456.267 \| 74588.002 \| 4.041 56 cores: Input \| BF32 / ms \| FP32 / ms \| Speed up -- \| -- \| -- \| -- M: 8192, N: 768, K: 768, trans_a: False, trans_b: False \| 1199.400 \| 1715.548 \| 1.430 M: 8192, N: 768, K: 768, trans_a: False, trans_b: True \|1129.204 \| 1708.912 \| 1.513 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: False \| 3655.915 \| 7992.877 \| 2.186 M: 8192, N: 768, K: 3072, trans_a: False, trans_b: True \| 3707.993 \| 8026.191 \| 2.165 Batch: 768, M: 128, N: 64, K: 128 \| 1296.419 \| 1308.411 \| 1.009 Pull Request resolved: https://github.com/pytorch/pytorch/pull/116015 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-20 09:30:23 +00:00
CaoE	29516bd2a0	add _amp_foreach_non_finite_check_and_unscale_cpu_ and _amp_update_scale_cpu_ kernels on CPU (#109281 ) Step1 of https://github.com/pytorch/pytorch/issues/111559. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109281 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-01-16 15:25:08 +00:00
Edward Z. Yang	2200118f59	Enable some uint{16,32,64} tests that are working (#116809 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116809 Approved by: https://github.com/albanD	2024-01-15 02:25:21 +00:00
Edward Z. Yang	edec54b9de	Add `torch._lazy_clone` to create COW tensors (#113397 ) Part of #109833 Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #113397 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397 Approved by: https://github.com/ezyang	2024-01-11 01:32:44 +00:00
Edward Z. Yang	8bcdde5058	Support uint{16,32,64} deterministic empty fill and scalar Python binding handling (#116807 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116807 Approved by: https://github.com/albanD ghstack dependencies: #116805, #116806	2024-01-10 02:17:23 +00:00
Edward Z. Yang	43a23a704a	Support uint{16,32,64} copy (#116806 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116806 Approved by: https://github.com/albanD ghstack dependencies: #116805	2024-01-10 02:17:23 +00:00
Edward Z. Yang	2e983fcfd3	Support unsigned int for randint, item, equality, fill, iinfo, tensor (#116805 ) These are some basic utilities that are often used for testing. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116805 Approved by: https://github.com/albanD	2024-01-10 02:17:23 +00:00
Aaron Gokaslan	3fe437b24b	[BE]: Update flake8 to v6.1.0 and fix lints (#116591 ) Updates flake8 to v6.1.0 and fixes a few lints using sed and some ruff tooling. - Replace `assert(0)` with `raise AssertionError()` - Remove extraneous parenthesis i.e. - `assert(a == b)` -> `assert a == b` - `if(x > y or y < z):`->`if x > y or y < z:` - And `return('...')` -> `return '...'` Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/116591 Approved by: https://github.com/albanD, https://github.com/malfet	2024-01-03 06:04:44 +00:00
Aaron Gokaslan	bd10fea79a	[BE]: Enable F821 and fix bugs (#116579 ) Fixes #112371 I tried to fix as many of the bugs as I could, a few I could not figure out what the proper fix for them was though and so I left them with noqas. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116579 Approved by: https://github.com/ezyang	2024-01-01 08:40:46 +00:00
Yanbo Liang	f657b2b1f8	[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 ) After this refactor: * ```TorchVariable``` definition and all references are removed. * All ```is_allowed``` references except one are removed. - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312 Approved by: https://github.com/jansel	2023-12-27 18:47:05 +00:00
PyTorch MergeBot	3b709d7c1e	Revert "[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 )" This reverts commit `015bd0e0a1`. Reverted https://github.com/pytorch/pytorch/pull/116312 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/116312#issuecomment-1869825506))	2023-12-26 23:47:15 +00:00
Yanbo Liang	015bd0e0a1	[Dynamo][10/N] Remove TorchVariable and is_allowed (#116312 ) After this refactor: * ```TorchVariable``` definition and all references are removed. * All ```is_allowed``` references except one are removed. - The only left one is in ```torch/_dynamo/decorators:_disallow_in_graph_helper```. It was called when users put ```disallow_in_graph``` decorator on a function. Since we use the lists in ```trace_rules``` to decide the function's trace rule, so the decorator would only be used as customer function rather than torch functions. I'll defer this to a separate decorator refactor PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/116312 Approved by: https://github.com/jansel	2023-12-23 09:44:09 +00:00
Mikayla Gawarecki	f206e31e2f	Swap slots if slots match in swap_tensor (#116128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116128 Approved by: https://github.com/albanD	2023-12-21 00:43:30 +00:00
Kurt Mohler	8a8d0adc0b	Fix `troch.gradient` check for spacing arg list length (#115686 ) Fixes #114207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115686 Approved by: https://github.com/albanD	2023-12-13 20:17:20 +00:00
mantaionut	d521857411	Terminate handler (#101332 ) Fixes #50051. This PR is based on #50320 and I address the last feedback. On Windows it is enabled by default. Can be enabled or disabled via USE_CUSTOM_TERMINATE env variable. This PR adds support for overriding the terminate handler in order to log uncaught exceptions in the threads. If an exception is thrown and not caught, it will print <Unhandled exception caught in c10/util/AbortHandler.h> The point of doing this is that in issue #50051, exceptions were thrown but not logged. With this logging system it will be easier to debug it in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/101332 Approved by: https://github.com/albanD, https://github.com/malfet	2023-12-12 17:55:27 +00:00
ecao	65651d970b	Optimize the copy of Half to Float and Float to Half on CPU (#103148 ) ### Description Optimize the copy of Half to Float and Float to Half on CPU. ### Testing Single core: shape \| fp16 -> fp32 / ms \| fp32 -> fp16 / ms \| bf16 -> fp32 / ms \| fp32 -> bf16 / ms -- \| -- \| -- \| -- \| -- size: (1, 777) \| 0.00345 \| 0.00344 \| 0.00411 \| 0.00410 size: (2, 512) \| 0.00355 \| 0.00344 \| 0.00431 \| 0.00400 size: (10, 555) \| 0.00473 \| 0.00391 \| 0.00562 \| 0.00477 size: (1, 2048, 1024) \| 0.488 \| 0.480 \| 0.498 \| 0.499 size: (32, 100, 777) \| 0.584 \| 0.568 \| 0.571 \| 0.587 28 cores: shape \| fp16 -> fp32 / ms \| fp32 -> fp16 / ms \| bf16 -> fp32 / ms \| fp32 -> bf16 / ms -- \| -- \| -- \| -- \| -- size: (10, 555) \| 0.00472 \| 0.00369 \| 0.00576 \| 0.00481 size: (1, 2048, 1024) \| 0.0189 \| 0.0188 \| 0.0173 \| 0.0251 size: (64, 512, 1024) \| 3.159 \| 2.375 \| 3.152 \| 2.358 size: (32, 100, 777) \| 0.0225 \| 0.0195 \| 0.0193 \| 0.0261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/103148 Approved by: https://github.com/jgong5, https://github.com/cpuhrsch	2023-12-12 05:57:52 +00:00
FFFrog	3361496f96	Fix the corner case of index_add (#114929 ) Fixes #114864 As the title stated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/114929 Approved by: https://github.com/mikaylagawarecki	2023-12-09 01:57:25 +00:00
albanD	a2b89154bf	New swap function (#111747 ) This PR is proposing a new approach to solve the nn/optim only linked by python object identity problem. The idea is to have a function that can swap the content of two Tensors t1 and t2 while preserving all the old references. This would allow us to swap the `model.weight` with a new Tensor (can be any subclass of Tensor and any TensorImpl (xla, sparse, nested tensorimpl would work)). The use within nn will be done in a follow up. This is done by swapping the whole content of the PyObject and then putting back the fields associated with external references (refcount, gc tracking and weakrefs). Note that we have to properly handle all the cases where there is memory used before the public pointer PyObject* and where the PyObject is bigger due to dict/weakref being inlined (older CPython version) or due to slots. The main limitation of this approach is that the number of slots need to match for the objects being swapped and thus limit usage of slots in subclasses. Draft right now to see what @colesbury thinks about doing this? Pull Request resolved: https://github.com/pytorch/pytorch/pull/111747 Approved by: https://github.com/colesbury	2023-12-08 18:49:35 +00:00
Kurt Mohler	6f32eb7eef	Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 ) Fixes #95578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590 Approved by: https://github.com/peterbell10	2023-12-01 18:56:09 +00:00
PyTorch MergeBot	013675ff59	Revert "Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 )" This reverts commit `f1286161a6`. Reverted https://github.com/pytorch/pytorch/pull/111590 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing XLA job. The job is also failing on the PR, but the log classifier failed to find the failed test which lead to it being marked wrongly as flaky ([comment](https://github.com/pytorch/pytorch/pull/111590#issuecomment-1833004794))	2023-11-30 02:28:14 +00:00
Kurt Mohler	f1286161a6	Add decomp for `replication_pad2d` and use for CUDA deterministic (#111590 ) Fixes #95578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111590 Approved by: https://github.com/peterbell10	2023-11-29 21:50:46 +00:00
PyTorch MergeBot	fe428a284b	Revert "Add `torch._lazy_clone` to create COW tensors (#113397 )" This reverts commit `9916d8a9ea`. Reverted https://github.com/pytorch/pytorch/pull/113397 on behalf of https://github.com/DanilBaibak due to Unfortunately, I need to revert your PR because the lower [PR in the stack](https://github.com/pytorch/pytorch/pull/113396) is failing a bunch of internal build jobs. ([comment](https://github.com/pytorch/pytorch/pull/113397#issuecomment-1818761224))	2023-11-20 10:21:09 +00:00
PyTorch MergeBot	d40d72d664	Revert "Skip test_lazy_clone for Inductor (#114012 )" This reverts commit `ecd8d388b9`. Reverted https://github.com/pytorch/pytorch/pull/114012 on behalf of https://github.com/DanilBaibak due to I revert the PR due to the original changes broke the internal build. Here is the original diff stack [D51444337](https://www.internalfb.com/diff/D51444337) ([comment](https://github.com/pytorch/pytorch/pull/114012#issuecomment-1818745425))	2023-11-20 10:12:44 +00:00
Nikita Shulga	ecd8d388b9	Skip test_lazy_clone for Inductor (#114012 ) As half of those tests fail if run individually, but first failure masks all subsequent ones, i.e. ``` PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -v -k test_lazy_clone_cuda_float32 test_lazy_clone_cuda_float32 (__main__.TestTorchDeviceTypeCUDA) ... FAIL ... self.assertTrue(torch._C._is_cow_tensor(t)) AssertionError: False is not true ---------------------------------------------------------------------- Ran 1 test in 19.419s FAILED (failures=1) ``` But ``` $ PYTORCH_TEST_WITH_INDUCTOR=1 python3 test/test_torch.py -k test_lazy_clone_ ... ...................... ---------------------------------------------------------------------- Ran 24 tests in 24.969s OK ``` This flaky behavior was already detected, for example see https://github.com/pytorch/pytorch/issues/113953 Pull Request resolved: https://github.com/pytorch/pytorch/pull/114012 Approved by: https://github.com/huydhn, https://github.com/kit1980	2023-11-18 04:57:00 +00:00
Kurt Mohler	9916d8a9ea	Add `torch._lazy_clone` to create COW tensors (#113397 ) Part of #109833 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113397 Approved by: https://github.com/ezyang ghstack dependencies: #113396	2023-11-17 01:58:51 +00:00
Brian Hirsh	cebad9867b	graph break on intermediate leaves that require grad (#113277 ) fixes https://github.com/pytorch/pytorch/issues/90552. This is a simpler fix that just detects the situation where AOTAutograd can't create a proper backward graph for the situation and graph breaks. This was technically a silent correctness issue before. This PR tries to always graph break when we see a factory function that returns a tensor requiring grad. I check this by seeing if the op returned a `TensorVariable` in dynamo, and if one of the input arguments was a `requires_grad=True` kwarg. I think this is high-fidelity enough, and I'm also hoping that this is uncommon enough that a graph break is reasonable here. The fix to avoid the graph break in user land is also pretty easy - just instantiate your tensor outside of the compiled region and plumb it in. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113277 Approved by: https://github.com/eellison ghstack dependencies: #113267, #113416, #113584	2023-11-16 02:47:45 +00:00
Nikita Shulga	78f3937ee8	[BE] Handle errors in `set_num_threads` (#113684 ) and `set_num_interop_threads` Before that, call `torch.set_num_threads(265)` resulted in segmentation fault, afterwards it becomes a good old runtime error: ``` % python -c "import torch;torch.set_num_threads(265)" Traceback (most recent call last): File "<string>", line 1, in <module> RuntimeError: Overflow when unpacking long ``` Similar to https://github.com/pytorch/pytorch/pull/60073 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113684 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-11-15 06:17:41 +00:00
Kurt Mohler	8bdce9bb74	Fix `UntypedStorage.resize_` to keep same CUDA device index (#113386 ) Fixes #113300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113386 Approved by: https://github.com/albanD	2023-11-10 01:57:25 +00:00
Kurt Mohler	fd209543d5	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 ) Part of #109802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377 Approved by: https://github.com/albanD, https://github.com/aaronenyeshi	2023-11-01 16:10:09 +00:00
PyTorch MergeBot	ace2713d1e	Revert "Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 )" This reverts commit `f1785373c0`. Reverted https://github.com/pytorch/pytorch/pull/111377 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/111377#issuecomment-1784179040))	2023-10-29 17:41:55 +00:00
Nikita Shulga	b61efe1c2b	Fix `torch.[size\|stride]`(dim=None)` invocation (#111991 ) Per documentation, one should be able to explicitly pass dim argument as None to get tensor size across all dimentions/strides, but before this change it was incorrectly interpreted as named tensor call. Modify `size` and `stride` signatures generated by `gen_pyi.py` to highlight that overload with `None` will return a Tuple, but one with `dim: _int` returns `int`. Add regression test to validate the behavior, and remove the check for asserts from two named tensors tests (NamedTensors are dead, aren't they?) Fixes https://github.com/pytorch/pytorch/issues/111944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111991 Approved by: https://github.com/zou3519	2023-10-26 04:14:35 +00:00
Kurt Mohler	f1785373c0	Add `torch.utils.deterministic.fill_uninitialized_memory` flag (#111377 ) Part of #109802 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111377 Approved by: https://github.com/albanD	2023-10-26 02:39:06 +00:00
Nikita Shulga	7709382b50	Fix regression in `torch.equal` behavior for NaNs (#111699 ) `torch.equal(x, x)` should return false if one of `x` is a tenor of floats one of which is NaN. So, it renders some of the optimization proposed in https://github.com/pytorch/pytorch/pull/100024 invalid, though as result `torch.equal` will become much slower for identical floating point tensors. Add regression test that calls torch.equal for tensor containing NaN Fixes https://github.com/pytorch/pytorch/issues/111251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111699 Approved by: https://github.com/Skylion007, https://github.com/albanD	2023-10-21 00:02:45 +00:00
CaoE	d1afb7d43d	add Half support for multinomial on CPU (#104178 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104178 Approved by: https://github.com/jgong5, https://github.com/kulinseth, https://github.com/cpuhrsch	2023-10-20 19:16:04 +00:00
Evgeni Burovski	48989bc820	trace frames with np.ndarray (#110512 ) Fixes #109604 Resubmit gh-109715 + several skips and small fixes to make tests pass. The main fix here is by @ysiraichi : previously, dynamo did not resume tracing numpy ndarrays after a graph break. While at it, fix several small issues Yukio's fix uncovers: - graph break gracefully on numpy dtypes which do not map to torch.dtypes (uint16 etc) - recognize array scalars in dynamo, treat them as 0D ndarrays - make sure that iterating over torch.ndarray generates arrays not bare tensors Pull Request resolved: https://github.com/pytorch/pytorch/pull/110512 Approved by: https://github.com/lezcano	2023-10-15 00:56:10 +00:00
CaoE	8713a1a363	add Half support for bernoulli on CPU (#104176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/104176 Approved by: https://github.com/mingfeima, https://github.com/cpuhrsch	2023-10-13 01:18:55 +00:00
Prachi Gupta	53a9ac534c	Added decorator `skipRocmIfTorchInductor` and skipped failing tests (#107760 ) This PR adds a skip decorator which will disable tests in CI for ROCm inductor workflow. This new workflow will be coming in via https://github.com/pytorch/pytorch/pull/110544 Pull Request resolved: https://github.com/pytorch/pytorch/pull/107760 Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/atalman	2023-10-12 16:00:35 +00:00
Elias Ellison	cf1da9bd17	enable index add test (#111016 ) Dynamo is swallowing a user exception when suppress_errors is set to True. There's an issue filed for that: https://github.com/pytorch/pytorch/issues/108798. In the meantime we still like the functionality in this test which works without the default setting (dont suppress errors) to not regress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/111016 Approved by: https://github.com/yanboliang	2023-10-11 19:41:35 +00:00
eellison	fb4b9e9c8e	Re-enable a couple of fixed tests (#110770 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110770 Approved by: https://github.com/yanboliang, https://github.com/int3, https://github.com/Skylion007 ghstack dependencies: #110651	2023-10-10 19:13:14 +00:00
eellison	c5f06b9753	Re-enable test_copy_transpose_math_view, neg_view/dce fix (#110651 ) - neg view can just be lowered to neg() post functionalization - we were treating all fallback kernels as not having side effects. we shouldn't dce mutating fallback kernels - either mutations induced by the reinplacing pass or clone_ with unsupported arguments (complex) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110651 Approved by: https://github.com/Chillee, https://github.com/jansel, https://github.com/malfet, https://github.com/Skylion007	2023-10-10 16:34:01 +00:00
jjsjann123	37567fdf31	Nvfuser cpp api deprecation attempt 2 (#110881 ) attempting to re-try #110318 deprecating nvfuser c++ API warning has been updated to TORCH_WARN_ONCE; Warning thrown inside torch::jit::fuser::cuda::isEnabled() is turned off and will be deprecated when we pulled out TorchScript integration in the follow up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110881 Approved by: https://github.com/davidberard98, https://github.com/NicolasHug	2023-10-10 08:07:03 +00:00
PyTorch MergeBot	bbdc8c7b05	Revert "deprecating nvfuser c++ API (#110318 )" This reverts commit `bf0866fc16`. Reverted https://github.com/pytorch/pytorch/pull/110318 on behalf of https://github.com/davidberard98 due to too many warnings being thrown in torchvision https://github.com/pytorch/pytorch/issues/110857 ([comment](https://github.com/pytorch/pytorch/pull/110318#issuecomment-1753245449))	2023-10-09 15:41:50 +00:00
jjsjann123	bf0866fc16	deprecating nvfuser c++ API (#110318 ) deprecating nvfuser c++ API Pull Request resolved: https://github.com/pytorch/pytorch/pull/110318 Approved by: https://github.com/davidberard98	2023-10-07 02:25:21 +00:00
eellison	3812f2e40c	Preserve layout on like constructors (#110242 ) Partially fixes `test_memory_format_factory_like_functions_preserve` with PYTORCH_TEST_WITH_INDUCTOR. Inductor preserves memory layouts for user-visible outputs as annotated on the fx graph that it is passed in. That graph is generated from running aot_autograd with decompositions. If the decompositions give incorrect strides, so will inductor. This preserves the layout of `_like` operators when it corresponds to a `torch.memory_format`. It doesnt fix a) arbitrary permutations, b) striding of non-dense outputs. Both of these are lower-pri compared to preserving channels last. We would need either https://github.com/pytorch/pytorch/issues/92920 or a `to` variant that takes in a physical layout arbitrary permutations. I converted the output of rand to the correct layout instead of passing the layout in so that this would compose with the `replace_random` pass, and because the two pointwise ops will get fused anyway. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110242 Approved by: https://github.com/int3	2023-10-02 23:53:55 +00:00
Moritz Hennen	09c598745c	Rename `torch._C._TensorBase` to `TensorBase` (#109940 ) I have gone ahead and implemented the renaming of the type `torch._C._TensorBase` to a non-private class name `TensorBase`. The changes also include leaving `torch._C._TensorBase` as an alias to the new type: `70458768fb/torch/csrc/autograd/python_variable.cpp (L2196-L2197)` both in the c++ code and in the corresponding `__init__.pyi.in` file: `70458768fb/torch/_C/__init__.pyi.in (L1522)` Fixes #109438 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109940 Approved by: https://github.com/ezyang	2023-09-25 19:10:22 +00:00
Jez Ng	063a62622b	Add memory overlap check to `meta_copy_` (#108989 ) Fixes `test_copy_many_to_one`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108989 Approved by: https://github.com/eellison	2023-09-12 23:28:14 +00:00
Kurt Mohler	4c5e43574c	Reland 2: Add PyObject preservation for UntypedStorage (#109039 ) Relands #103907 after it was reverted. This PR makes the new `ignore_hermetic_tls` argument of `check_pyobj` optional to avoid causing a compilation error in torchdistx Part of #91395 Pull Request resolved: https://github.com/pytorch/pytorch/pull/109039 Approved by: https://github.com/ezyang	2023-09-12 22:26:05 +00:00
PyTorch MergeBot	41bd0fde7e	Revert "Remove fixed skips (#108674 )" This reverts commit `ab9fb03d6f`. Reverted https://github.com/pytorch/pytorch/pull/108674 on behalf of https://github.com/huydhn due to Sorry for picking this up a bit late, but with https://github.com/pytorch/pytorch/pull/108647 reverted, these tests are failing again. So we need to wait for the PR to reland before we can land this change ([comment](https://github.com/pytorch/pytorch/pull/108674#issuecomment-1715202692))	2023-09-12 08:04:32 +00:00
PyTorch MergeBot	59f605be57	Revert "Reland 2: Add PyObject preservation for UntypedStorage (#109039 )" This reverts commit `419e4e17a2`. Reverted https://github.com/pytorch/pytorch/pull/109039 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing linter job in trunk, probably due to a landrace ([comment](https://github.com/pytorch/pytorch/pull/109039#issuecomment-1715147020))	2023-09-12 07:26:11 +00:00

1 2 3 4 5 ...

2147 Commits