pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Jason Ansel	bf68e16e94	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-13 01:14:18 +00:00
eqy	d732df7e56	[Inductor] Disable TF32 in `test_slice_scatter_reinplace` (#135709 ) TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709 Approved by: https://github.com/eellison	2024-09-13 00:30:45 +00:00
Sahan Paliskara	c9de2efde6	[Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894 ) Addresses https://github.com/pytorch/pytorch/issues/135880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-09-13 00:19:42 +00:00
Jason Ansel	1f15c0c7a5	[fx] Replace _snake_case with a regexp (#135822 ) ~2x speedup on this function, though saves <0.5s overall Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820, #135821	2024-09-13 00:18:41 +00:00
Jason Ansel	a72124add9	[fx] Minor optimization in create_arg (#135821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820	2024-09-13 00:18:41 +00:00
Jason Ansel	10ca4c0564	[inductor] Use TracerBase directly in LoopBody (#135820 ) This skips some unneeded work in the subclass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788	2024-09-13 00:18:41 +00:00
Jason Ansel	d3aab9642b	[inductor] Optimize can_fuse_vertical() (#135788 ) An O(n^2) to O(n) improvement by not comparing all pairs of deps. Before: ![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9) After: ![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788 Approved by: https://github.com/oulgen ghstack dependencies: #135787	2024-09-13 00:18:41 +00:00
Jason Ansel	67a929eea8	[inductor] Remove unused check (#135787 ) I think this is unreachable code because mode is always None on reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787 Approved by: https://github.com/oulgen	2024-09-13 00:18:41 +00:00
Isuru Fernando	f576960bbc	do not expand in replace/simplify if no changes (#135863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863 Approved by: https://github.com/ezyang	2024-09-13 00:12:01 +00:00
Nikita Shulga	1aba224cfd	Update nightly PyTorch version to 2.6.0 (#135916 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916 Approved by: https://github.com/kit1980	2024-09-13 00:08:52 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
Ma Jian	00dc7d4356	fix compiled_autograd deadlock throw (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-12 23:24:57 +00:00
Yanbo Liang	1760bbc259	[FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823 ) Fixes #134739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823 Approved by: https://github.com/BoyuanFeng	2024-09-12 23:11:01 +00:00
Jack Taylor	fb9d8e3248	[ROCm] Use ieee precision for fp32 in flex attention (#135702 ) `3bebc09be9` Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-09-12 23:00:48 +00:00
eellison	aaabfc8930	[Easy] Check if quant registered in constant folding (#135875 ) Belated fix for https://github.com/pytorch/pytorch/issues/110904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875 Approved by: https://github.com/shunting314	2024-09-12 22:16:39 +00:00
William Wen	63d6cd351a	[dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404 ) Fixes https://github.com/pytorch/pytorch/issues/134608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-09-12 22:04:48 +00:00
PyTorch MergeBot	3de9e474df	Revert "Check function declarations of Core ML code (#135467 )" This reverts commit `bc1b8f094d`. Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))	2024-09-12 22:04:35 +00:00
PyTorch MergeBot	3e1a4ea132	Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 )" This reverts commit `83c594ebd6`. Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](`83c594ebd6`) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))	2024-09-12 21:47:38 +00:00
Sanskar Modi	e157ce3ebb	Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 ) Adding validation checks to check the input types and display better error messages for the same. Fixes https://github.com/pytorch/pytorch/issues/135463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596 Approved by: https://github.com/malfet	2024-09-12 21:28:37 +00:00
Pian Pawakapan	b897ab0540	[export] ignore mark_dynamic() in export (#135536 ) Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`. Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536 Approved by: https://github.com/avikchaudhuri	2024-09-12 21:22:19 +00:00
Fadi Arafeh	3d24313809	Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 ) Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR* the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058 Approved by: https://github.com/jondea, https://github.com/malfet	2024-09-12 20:30:20 +00:00
Riley Dulin	cd472bb1e3	[torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553 ) Summary: Sometimes we only want to generate a replacement for a matched pattern once we know some information about the nodes in the pattern. So far, we have found this the most useful to do matches based on specific shapes of tensors flowing into functions. Use a callback function similar to `match_filters`. By default this isn't used. Had to make `replacement` a None-able parameter because Callable was already used to detect a case where a graph needed to be traced. Differential Revision: D62412628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553 Approved by: https://github.com/SherlockNoMad	2024-09-12 18:52:14 +00:00
Guilherme Leobas	f032135bbf	Add batching rule for torch.scatter_reduce (#135547 ) Fixes #134797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547 Approved by: https://github.com/zou3519	2024-09-12 18:51:21 +00:00
Joel Schlosser	525bec804c	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-12 17:54:25 +00:00
wz337	83c594ebd6	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-12 17:43:57 +00:00
Rachel Guo	c1277945d3	[AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731 ) Summary: As title. Effect after merging this diff would look something like this: ``` print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0) triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0) print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0) buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32) # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm] print('inductor: before_launch - extern_kernels.addmm - buf0', buf0) extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1) print('inductor: after_launch - extern_kernels.addmm - buf0', buf0) ``` Context: D62272588 only support major triton kernel jit inductor debug printing codegen Test Plan: CI & OSS CI Reviewed By: chenyang78, ColinPeppler Differential Revision: D62397017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731 Approved by: https://github.com/ColinPeppler	2024-09-12 17:31:10 +00:00
Isuru Fernando	dab7d646d5	Use a better decomposition for split_with_sizes (#135728 ) This decomposition has less checks and improves the performance of torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728 Approved by: https://github.com/ezyang	2024-09-12 16:38:51 +00:00
whywhy-rtx3090	7647c398ff	Allow optional positional arguments for `torch.func.functional_call` (#134643 ) This PR resolves #134408. Add an additional test and have passed the local test. Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs. This PR does not include any such post-check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643 Approved by: https://github.com/zou3519	2024-09-12 15:22:06 +00:00
Justin Chu	d67cc58181	[ONNX] Fix symbolic values and numpy implementation (#135786 ) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes https://github.com/pytorch/pytorch/issues/135700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786 Approved by: https://github.com/titaiwangms	2024-09-12 14:24:43 +00:00
Animesh Jain	dddaadac6c	[dynamo] Dont graph break on inner torch.compile (#135819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819 Approved by: https://github.com/jansel	2024-09-12 11:39:09 +00:00
Jason Ansel	02169364e1	[inductor] Split reduction loops when there is no shared reads (#134307 ) Fixes #129102 ![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307 Approved by: https://github.com/shunting314	2024-09-12 09:45:08 +00:00
Yanbo Liang	c30042fbeb	[GPT-fast] Update compilation time target for Llama & Mixtral (#135817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817 Approved by: https://github.com/xmfan, https://github.com/huydhn	2024-09-12 07:13:44 +00:00
Sun, Jiayi	6700175531	[Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574 ) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-12 06:56:34 +00:00
Xilun Wu	de8a8653c0	[dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554 ) Summary 1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`. 2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks. Test `pytest test/distributed/_tensor/test_dtensor.py` `pytest test/distributed/_tensor/test_init.py` `pytest test/distributed/_tensor/test_tensor_ops.py` Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-09-12 06:30:09 +00:00
Jason Ansel	86335e9135	[reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735 Approved by: https://github.com/oulgen	2024-09-12 05:50:39 +00:00
angelayi	14e3f3c062	[aoti] Remove nlohmann/json.hpp from header (#135765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765 Approved by: https://github.com/malfet	2024-09-12 05:38:51 +00:00
Dmitry Rogozhkin	9852c6d236	xpu: fix 3rd party builds on systems with cmake<3.25 (#135767 ) Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement. See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html Fixes: #135766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767 Approved by: https://github.com/malfet, https://github.com/guangyey	2024-09-12 05:31:01 +00:00
Jason Ansel	6354271178	[inductor] Skip unused call to get_estimated_runtime() (#135776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776 Approved by: https://github.com/oulgen ghstack dependencies: #135445, #135446	2024-09-12 05:22:23 +00:00
Jason Ansel	12902f6ecf	[inductor] Cache get_operation_names/get_buffer_names (#135446 ) Before: ![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311) After: ![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446 Approved by: https://github.com/oulgen ghstack dependencies: #135445	2024-09-12 05:22:23 +00:00
Jason Ansel	3decb676aa	[inductor] Optimize cache_on_self (#135445 ) This is a small compile time win, but also makes profiles more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445 Approved by: https://github.com/oulgen	2024-09-12 05:22:23 +00:00
Zhenbin Lin	8d68a02905	OpenReg: Split the daemon into drvier/executor (#135646 ) Split the daemon into a proper user-process driver vs device-process executor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646 Approved by: https://github.com/albanD	2024-09-12 05:03:46 +00:00
Jason Ansel	28330a8a39	[reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733 Approved by: https://github.com/oulgen	2024-09-12 04:29:37 +00:00
Animesh Jain	eaba287adb	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg	2024-09-12 04:05:08 +00:00
cyy	f5f1d0a753	Fix build warnings for torch_python (#134981 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981 Approved by: https://github.com/ezyang	2024-09-12 03:59:34 +00:00
Adam J. Stewart	5bc238c73e	torch.hub: add get_dir/set_dir type hints (#134906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906 Approved by: https://github.com/Skylion007	2024-09-12 03:53:29 +00:00
He Kai	79223114db	Avoid inserting extra transpose when the input to group norm is NHWC (#135575 ) When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575 Approved by: https://github.com/ezyang	2024-09-12 03:36:05 +00:00
cyy	7cfd23636c	Fix clang-tidy warnings in Caffe2 code (#134935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935 Approved by: https://github.com/ezyang	2024-09-12 03:27:09 +00:00
Feng Yuan	0d1d69fd25	Update torch-xpu-ops pin (ATen XPU implementation) (#135647 ) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647 Approved by: https://github.com/EikanWang	2024-09-12 03:16:08 +00:00
Aaron Orenstein	21a64d57b1	[BE] typing for decorators - masked/_ops (#135108 ) Differential Revision: D62184735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108 Approved by: https://github.com/Skylion007	2024-09-12 01:34:09 +00:00

... 5 6 7 8 9 ...

78699 Commits