pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 00:21:07 +01:00

Author	SHA1	Message	Date
Riley Dulin	3be150653c	[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282 ) Summary: Add a customizable loss function callback to NodeAccuracySummary to allow users to pass in their own loss function. Also, fix some type errors and propagate better exception messages when unexpected tensor comparisons occur. Finally, enhance the robustness of `generate_numeric_debug_handle` in the case where it is called multiple times on the same model, by avoiding reuse of the same IDs. Test Plan: Added a test for this case in `test_numeric_debugger`. Reviewed By: jerryzh168 Differential Revision: D62898297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282 Approved by: https://github.com/jerryzh168	2024-09-24 03:28:12 +00:00
Guilherme Leobas	e09c5b6046	Remove `vt` argument in `raise_observed_exception` (#136037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037 Approved by: https://github.com/zou3519	2024-09-24 02:36:57 +00:00
fduwjj	9372692c7b	[FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473 ) Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473 Approved by: https://github.com/c-p-i-o	2024-09-24 02:34:43 +00:00
Nikita Shulga	4fd16dd8aa	Clarify that `libtorch` API is C++17 compatible (#136471 ) As it relies on some common C++17 primitives, such as `std::optional` Replace all docs references from C++14 to C++17 Fixes https://github.com/pytorch/pytorch/issues/133205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-24 02:03:33 +00:00
Jez Ng	e4d294221b	[inductor] Log precompilation time (#136395 ) This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395 Approved by: https://github.com/eellison	2024-09-24 01:47:54 +00:00
Edward Z. Yang	802ba79121	Inherit all secrets to inductor workflow (#135354 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354 Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet	2024-09-24 01:30:40 +00:00
Aaron Orenstein	06909803cc	Existing mypy issues (#136236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236 Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007	2024-09-24 01:02:07 +00:00
Xuan Zhang	a14f57b126	fix the inductor tests (#136474 ) Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474 Approved by: https://github.com/malfet	2024-09-24 00:59:22 +00:00
Nikita Shulga	9d9bc65b5e	Make `FlashAttentionKernel.cpp` compilable for SVE with GCC-11 (#136477 ) Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528 Hattip to @abhishek-iitmadras for the investigation Fixes https://github.com/pytorch/pytorch/issues/136432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477 Approved by: https://github.com/atalman, https://github.com/kit1980	2024-09-24 00:54:26 +00:00
Ke Wen	e0f84f40f7	[Pipelining] Allow non-0 stages to accept kwargs (#136416 ) For supporting usage case in torchchat: all non-0 stages requires `input_pos` and `cache_lane`. ``` kwargs = {"input_pos": input_pos, "cache_lane": lane} if pp_rank == first_pp_rank: output = decorder.step(new_token, kwargs) elif pp_rank == last_pp_rank: output = decorder.step(kwargs) else: # middle pp ranks decorder.step(**kwargs) ``` The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416 Approved by: https://github.com/wconstab	2024-09-23 23:50:59 +00:00
Guilherme Leobas	52c917b0ba	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-23 21:45:44 +00:00
fduwjj	5033a1ca0d	[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957 ) 1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case) 2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer. 3. Then the port be broadcasted for dynamic_rendezvous. Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server? Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957 Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o	2024-09-23 20:32:24 +00:00
PyTorch MergeBot	fd182b90a7	Revert "Add deterministic path for CUDA `cumsum` (#136224 )" This reverts commit `d45b0151e5`. Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))	2024-09-23 19:57:13 +00:00
Nikita Shulga	08dba25775	[BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449 ) - `Tensor::type()` -> `Tensor::scalar_type()` - `Tensor::data<T>()` -> `Tensor::data_ptr<T>()` Should fix following warnings during the compilation: ``` caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o[0m /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 739 \| int* rowOffsets = crow.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 740 \| int* colIndices = col.data<int>(); \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here 225 \| DeprecatedTypeProperties & type() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here 109 \| inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) { \| ^~~~~~~~~~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| ^ /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations] 753 \| AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] { \| /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here 247 \| T * data() const { \| ^ ~~ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449 Approved by: https://github.com/huydhn	2024-09-23 19:20:34 +00:00
Xiaodong Wang	9a1dc41de7	[AMD] Skipping 0 byte send/recv for AMD GPU (#136362 ) Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them. Reviewed By: danzimm Differential Revision: D63075000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362 Approved by: https://github.com/malfet, https://github.com/houseroad	2024-09-23 19:14:12 +00:00
Edward Z. Yang	3116fbda0f	Increase update_hint_regression problem size to 1000 (#136434 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434 Approved by: https://github.com/laithsakka	2024-09-23 18:51:44 +00:00
PyTorch MergeBot	274883083d	Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318 )" This reverts commit `d21841d077`. Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))	2024-09-23 17:47:49 +00:00
Aleksei Nikiforov	d859fcbc61	s390x: build s390x binaries on each pull request (#125399 ) Ensure that s390x keeps building for each PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399 Approved by: https://github.com/huydhn	2024-09-23 17:39:48 +00:00
Joel Schlosser	83a3ee0699	Support embedding_bag() with NJT input (#135888 ) Fixes #93843 `EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888 Approved by: https://github.com/cpuhrsch	2024-09-23 17:35:19 +00:00
James Wu	4649aeaebf	Make AOTAutogradCache support remote FXGraphCache (#136173 ) Summary: After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache. This does not implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache. Test Plan: (Meta only tests) Reviewed By: aorenste Differential Revision: D62384944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173 Approved by: https://github.com/oulgen	2024-09-23 17:24:27 +00:00
Nikita Shulga	c3e678382b	Fix addmm silent correctness on aarch64 (#136371 ) Do not dispatch to fast gemmv functions when alpha is not equal to 1 Add regression test to address the problem Fixes https://github.com/pytorch/pytorch/issues/136299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371 Approved by: https://github.com/swolchok	2024-09-23 17:10:34 +00:00
Edward Z. Yang	f0f79dd8f1	Correctly convert Python float to float64 when passing argument as Tensor (#136413 ) I can't actually test the Dynamo codegen fix as it is impossible to directly use the Tensor at the moment. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413 Approved by: https://github.com/bobrenjc93	2024-09-23 16:48:08 +00:00
wz337	637d5c4b7e	[DSD] Fix loading uneven full tensor into sharded state dict (#136365 ) Fix #136228. This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365 Approved by: https://github.com/fegin	2024-09-23 16:35:58 +00:00
fduwjj	da51fe1c42	[FR] Fix errors in all2all check, improve some log output (#136399 ) We found that we show the hashed pg name in our script output, which is not UX friendly. Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable. Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399 Approved by: https://github.com/c-p-i-o	2024-09-23 16:31:31 +00:00
PyTorch MergeBot	df6a8fa1eb	Revert "[aotd] Fix freezing API for subclasses (#136265 )" This reverts commit `cdef760560`. Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))	2024-09-23 16:25:05 +00:00
Andrew Gu	9992084f38	[FSDP2] Fixed `test_all_gather_extensions_monkey_patch` (#136130 ) I messed up the test before. The extensions were not running :/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130 Approved by: https://github.com/weifengpy ghstack dependencies: #136129	2024-09-23 15:12:44 +00:00
Andrew Gu	b9f53c0dce	[FSDP2] Added module, mp policy to `fsdp_pre_all_gather` (#136129 ) - Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR. - Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example. The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129 Approved by: https://github.com/weifengpy	2024-09-23 15:12:36 +00:00
Bin Bao	d21841d077	[AOTI] Create another wrapper class to handle ArrayRef (#136318 ) Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code. Test Plan: CI Differential Revision: D62961885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318 Approved by: https://github.com/frank-wei	2024-09-23 15:10:27 +00:00
PyTorch MergeBot	0e19522122	Revert "Adds support for accelerated sorting with x86-simd-sort (#127936 )" This reverts commit `239a9ad65e`. Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](`239a9ad65e`) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))	2024-09-23 14:52:23 +00:00
Edward Z. Yang	bae427e4b1	Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107 ) By refactoring this way, I can put a non-expiring LRU cache here. Splitting also will make it easier for me to tell who is using up all the time. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107 Approved by: https://github.com/aorenste	2024-09-23 14:39:20 +00:00
PyTorch MergeBot	e9bfbf78d5	Revert "Allow fx graph caching higher order operators (opt-in) (#135877 )" This reverts commit `66d5eb64e0`. Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))	2024-09-23 09:04:24 +00:00
cyy	75f141be62	Avoid unnecessary CMake warnings on Windows (#136393 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393 Approved by: https://github.com/ezyang	2024-09-23 06:42:59 +00:00
Yuxin Wu	663e760065	add unittest for OOM message (#129671 ) Add unittest for the bug in #123984 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671 Approved by: https://github.com/eqy	2024-09-23 04:48:01 +00:00
Yiming Zhou	068fdd602f	[export] enable custom tag metadata re-export test (#136048 ) Improves and enables a commented out test originally introduced in #131912 In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048 Approved by: https://github.com/zhxchen17	2024-09-23 04:37:58 +00:00
Oguz Ulgen	66d5eb64e0	Allow fx graph caching higher order operators (opt-in) (#135877 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877 Approved by: https://github.com/zou3519	2024-09-23 04:33:27 +00:00
cyy	a38e4c5e1e	Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547 Approved by: https://github.com/ezyang	2024-09-23 03:44:55 +00:00
Isuru Fernando	f276da7f98	Remove prims.slice_in_dim and prims.slice (#136150 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150 Approved by: https://github.com/ezyang	2024-09-23 01:27:22 +00:00
Xilun Wu	3406ac24d9	[BE] fix circular import in torch/distributed/utils.py (#136286 ) Summary Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286 Approved by: https://github.com/Skylion007	2024-09-22 20:54:12 +00:00
Shangdi Yu	3bc073d728	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-22 04:51:37 +00:00
Zhou, Lingzhi	35532fc477	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-21 23:52:02 +00:00
cyy	e4cdc31227	[14/N] Fix clang-tidy warnings in aten/src/ATen (#133988 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988 Approved by: https://github.com/ezyang	2024-09-21 22:41:40 +00:00
Bob Ren	9731ccb9e0	Type _dynamo/variables/lazy.py (#136376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376 Approved by: https://github.com/Skylion007	2024-09-21 22:18:02 +00:00
Jovian Anthony Jaison	09715638ab	Add _dynamo.config.suppress_errors logging (#136379 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379 Approved by: https://github.com/ezyang	2024-09-21 21:00:26 +00:00
Aaron Orenstein	3176966732	update cache tests (#136215 ) Summary: - Clean up cache test code a bit. - Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check). Test Plan: unit tests Reviewed By: oulgen Differential Revision: D62648248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215 Approved by: https://github.com/bobrenjc93	2024-09-21 20:36:22 +00:00
Ramana Sundararaman	be4b7e8131	Param fixes in docstring (#136097 ) Fixes wrong param names in docstrings. cc: @kit1980 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097 Approved by: https://github.com/ezyang	2024-09-21 18:56:34 +00:00
Aaron Gokaslan	b6ffa381e1	[BE]: Add half CUDA support nextafter (#136373 ) Making CUDA support match CPU support for nextafter Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373 Approved by: https://github.com/ezyang	2024-09-21 17:13:45 +00:00
PyTorch MergeBot	cc17d58809	Revert "S390x update builder image (#132983 )" This reverts commit `080a249fc2`. Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))	2024-09-21 16:46:54 +00:00
Xuan Zhang	03957efa5d	[inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874 ) Motivations: A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf). Solutions: 1. implement a peak memory estimator via liveness analysis 2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory Results: On some models we can reduce the peak memory significantly: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:-----------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| alexnet \| 128 \| 1.17 \| 0.99 \| 1.19 \| \| vgg16 \| 64 \| 4.10 \| 3.57 \| 1.15 \| \| DebertaV2ForQuestionAnswering \| 1 \| 11.60 \| 10.56 \| 1.10 \| In the presence of compiler based AC, peak memory can be further reduced: \| model \| batch size \| peak_memory baseline \| peak_memory new \| ratio \| \|:------------------------------:\|:----------:\|:--------------------:\|:---------------:\|:-----:\| \| AlbertForMaskedLM \| 4 \| 6.87 \| 6.43 \| 1.07 \| \| AlbertForQuestionAnswering \| 4 \| 8.69 \| 7.76 \| 1.12 \| \| MobileBertForQuestionAnswering \| 128 \| 4.67 \| 3.90 \| 1.20 \| [Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case. Other infos: * neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_. * minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second. * no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874 Approved by: https://github.com/yf225	2024-09-21 16:28:38 +00:00
DavidGu-Datong	fb4670a1f9	fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174 ) Fixes #134848 For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174 Approved by: https://github.com/soulitzer	2024-09-21 14:21:42 +00:00
Will Constable	ea737e4e5d	[Pipelining] Make PipelineStage support meta initialization (#136243 ) Avoid allocating memory or dry-running the submodule during stage init. Save user-provided input/output metadata during stage init, to allow lazily initializing the buffers before the first step call. Later, we plan to build on top of this to add lazy shape inference (#130856) so that no input/output shapes are required at stage init. For now, we require input/output tensors for stage init, but these should be on meta device and stage should not allocate any real memory. Note: this needs more thorough testing and review, but it worked on the torchtitan 3d test. TODO: - delete 'device' arg from PipelineStage ctor? (move it to inferred from args tensors passed to first step call? separate PR. - delete 'output_args' from PipelineStage ctor? we don't actually need it, but we use it to do shape validation, which is why I didn't remove it in this PR. Proposal: leave it until we add lazy shape inference? Fixes #136225, #136226 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243 Approved by: https://github.com/H-Huang, https://github.com/kwen2501	2024-09-21 09:47:22 +00:00

1 2 3 4 5 ...

78699 Commits