pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 12:20:52 +01:00

Author	SHA1	Message	Date
PyTorch MergeBot	10c31e96df	Revert "[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 )" This reverts commit `7d12e6dceb`. Reverted https://github.com/pytorch/pytorch/pull/133876 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:09 +00:00
Yidi Wu	d261a1751a	[HOP] fix export x inline_inbuilt_nn_modules (#133731 ) TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))). We have two special treatments for following cases: 1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`, which mutates captured states when calling `__getattr__`. Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy. 2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ \| __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer. Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731 Approved by: https://github.com/anijain2305 ghstack dependencies: #134775	2024-08-30 15:58:20 +00:00
Yidi Wu	932c4ca5a0	make make_fx collective test single threaded (#134775 ) make_fx is not thread-safe due to mutating and patching global states. It's difficult and low roi to make it thread-safe so just turn the tracing test into a single-thread test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134775 Approved by: https://github.com/yifuwang	2024-08-30 15:58:20 +00:00
eqy	c07e566baf	[CUDA][P2P] Check device capability in `requires_cuda_p2p_access` (#134523 ) Tests seem to fail on e.g., Volta without this given the compile time meacros used e.g., in `79b7fff188/torch/csrc/distributed/c10d/intra_node_comm.cu (L487)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134523 Approved by: https://github.com/yifuwang, https://github.com/Skylion007	2024-08-30 14:08:55 +00:00
Joona Havukainen	92f282ca52	Enable batch matmul for result sizes > 232 the tensor can be split along batch axis (#133430 ) Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert. Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it: ``` import torch device='mps' a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device) b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device) res = torch.bmm(a, b) ``` Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 232. This lets us split up the computation along the batch axis to avoid going over the limit. Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-30 14:08:43 +00:00
wz337	50efbb9f1e	[DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603 Approved by: https://github.com/fduwjj ghstack dependencies: #133838, #133839, #134048	2024-08-30 08:13:37 +00:00
Animesh Jain	0f8bec4399	[dynamo] mark_static_nn_module (#134713 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 With this API, we can mark the offending module as static in detectron2. Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want. Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713 Approved by: https://github.com/jansel ghstack dependencies: #134653	2024-08-30 07:01:06 +00:00
Jason Ansel	a5630239ad	[dynamo] Improve minifier error message when fp64 not supported (#134737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134737 Approved by: https://github.com/anijain2305	2024-08-30 06:42:32 +00:00
Ankur Neog	1011e0ae98	Generalize devices specific UTs for dynamo (#130714 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices. We intend to extend similar generalization for the rest of the content in test/dynamo which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available. ## Changes carve out bert related content to another class use instantiate_device_type utility to instantiate this class for devices which support the functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714 Approved by: https://github.com/anijain2305	2024-08-30 05:02:47 +00:00
Animesh Jain	7a694f6683	[justknobs] Override __bool__ method (#134799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134799 Approved by: https://github.com/ezyang	2024-08-30 04:54:02 +00:00
PyTorch UpdateBot	75b86b1554	[executorch hash update] update the pinned executorch hash (#134736 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134736 Approved by: https://github.com/pytorchbot	2024-08-30 04:11:51 +00:00
Jack Taylor	5e8bf29148	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-08-30 03:38:35 +00:00
Xu Han	1f1e2eeb9d	[inductor] Install `tlparse` for test\dynamo\test_structured_trace.py UTs. (#134806 ) Install tlparse for test\dynamo\test_structured_trace.py UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134806 Approved by: https://github.com/ezyang	2024-08-30 03:16:03 +00:00
Laith Sakka	0d5f978795	add basic nn modules diff time benchmarks (#134658 ) benchmarks several shapes of basic nn modules. in both eager and inductor ``` collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 48602516013 compile time instruction count for iteration 1 is 20424350269 compile time instruction count for iteration 2 is 20440350455 compile time instruction count for iteration 3 is 20419269999 compile time instruction count for iteration 4 is 20430782200 compile time instruction count for iteration 5 is 20455049622 compile time instruction count for iteration 6 is 20157290712 compile time instruction count for iteration 7 is 20455324001 compile time instruction count for iteration 8 is 20450158317 compile time instruction count for iteration 9 is 20492987748 collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 961328334 compile time instruction count for iteration 1 is 958887896 compile time instruction count for iteration 2 is 958792214 compile time instruction count for iteration 3 is 958375977 compile time instruction count for iteration 4 is 958568525 compile time instruction count for iteration 5 is 958152305 compile time instruction count for iteration 6 is 959322800 compile time instruction count for iteration 7 is 958332703 compile time instruction count for iteration 8 is 958092100 compile time instruction count for iteration 9 is 958095277 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor compile time instruction count for iteration 0 is 3572145793 compile time instruction count for iteration 1 is 3503323973 compile time instruction count for iteration 2 is 3501962432 compile time instruction count for iteration 3 is 3501746084 compile time instruction count for iteration 4 is 3500687361 compile time instruction count for iteration 5 is 3822254676 compile time instruction count for iteration 6 is 3498356846 compile time instruction count for iteration 7 is 3499019157 compile time instruction count for iteration 8 is 3500780314 compile time instruction count for iteration 9 is 3500257458 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager compile time instruction count for iteration 0 is 1844838754 compile time instruction count for iteration 1 is 1843476862 compile time instruction count for iteration 2 is 1844761450 compile time instruction count for iteration 3 is 1845371742 compile time instruction count for iteration 4 is 1845159665 compile time instruction count for iteration 5 is 1845035802 compile time instruction count for iteration 6 is 1844895007 compile time instruction count for iteration 7 is 1844697922 compile time instruction count for iteration 8 is 1844780885 compile time instruction count for iteration 9 is 1844493990 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor compile time instruction count for iteration 0 is 1597839479 compile time instruction count for iteration 1 is 1348225351 compile time instruction count for iteration 2 is 1347340818 compile time instruction count for iteration 3 is 1348170800 compile time instruction count for iteration 4 is 1348637747 compile time instruction count for iteration 5 is 1678366444 compile time instruction count for iteration 6 is 1348412420 compile time instruction count for iteration 7 is 1348461578 compile time instruction count for iteration 8 is 1347420149 compile time instruction count for iteration 9 is 1349748195 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager compile time instruction count for iteration 0 is 137721777 compile time instruction count for iteration 1 is 139065517 compile time instruction count for iteration 2 is 137130552 compile time instruction count for iteration 3 is 137506030 compile time instruction count for iteration 4 is 137089838 compile time instruction count for iteration 5 is 137477395 compile time instruction count for iteration 6 is 138550452 compile time instruction count for iteration 7 is 137568409 compile time instruction count for iteration 8 is 136968468 compile time instruction count for iteration 9 is 137481664 collecting compile time instruction count for basic_modules_ModuleComparison_inductor compile time instruction count for iteration 0 is 917209684 compile time instruction count for iteration 1 is 899154426 compile time instruction count for iteration 2 is 898145079 compile time instruction count for iteration 3 is 899817018 compile time instruction count for iteration 4 is 899184687 compile time instruction count for iteration 5 is 898172885 compile time instruction count for iteration 6 is 899958951 compile time instruction count for iteration 7 is 899348186 compile time instruction count for iteration 8 is 897745404 compile time instruction count for iteration 9 is 899581123 collecting compile time instruction count for basic_modules_ModuleComparison_eager compile time instruction count for iteration 0 is 113165302 compile time instruction count for iteration 1 is 112724376 compile time instruction count for iteration 2 is 112774611 compile time instruction count for iteration 3 is 114465211 compile time instruction count for iteration 4 is 112689572 compile time instruction count for iteration 5 is 112726465 compile time instruction count for iteration 6 is 112853691 compile time instruction count for iteration 7 is 112295238 compile time instruction count for iteration 8 is 114022136 compile time instruction count for iteration 9 is 112664932 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649, #134652	2024-08-30 02:13:52 +00:00
Xilun Wu	a645a18d2e	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-30 02:13:45 +00:00
Chen Haifeng	27ffa67984	Support __class__ attr for tuple and list variables (#134099 ) Fixes #134086 This supports __class__ attribute for TupleVariable and ListVariable. And allows to construct a tuple or list by using __class__ attribute. This patch also fix a bug in NamedTupleVariable which misses a return on calling super var_getattr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134099 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-08-30 01:57:49 +00:00
Colin L. Rice	cf11fc0dcb	dynamo: Only log if we've disabled eval_frame once. (#134529 ) This spams logs pretty badly otherwise Pull Request resolved: https://github.com/pytorch/pytorch/pull/134529 Approved by: https://github.com/chuanhaozhuge, https://github.com/oulgen	2024-08-30 00:35:25 +00:00
Ivan Zaitsev	8b68912dfc	Correctly detect "Rate limit exceeded" error (#134785 ) Currently all 403 errors are treated as "Rate limit exceeded": https://github.com/pytorch/pytorch/actions/runs/10622019167/job/29445336924 [Github docs](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit) claim: > If you exceed your primary rate limit, you will receive a 403 or 429 response, and the x-ratelimit-remaining header will be 0. You should not retry your request until after the time specified by the x-ratelimit-reset header. After this change: https://github.com/pytorch/pytorch/actions/runs/10622365327/job/29446456395 Note, the 403 error in the jobs above is a separate issue, this PR addresses only the logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134785 Approved by: https://github.com/clee2000	2024-08-29 23:58:15 +00:00
Yu, Guangye	3402a5d865	fix windows xpu build issue (#133845 ) # Motivation If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`. # Solution Use explicitly `sycl-preview` in linux build only. # Additional Context For `find_library`, please note that the variable will not be updated if it has been stored. ``` If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845 Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet	2024-08-29 23:53:32 +00:00
leslie-fang-intel	3775fc982d	[Inductor][CPP] Fix Index name error (#134645 ) Summary Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect: * `_node` is a FX Node with target in ["index_expr", "load", "store"] * `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index` * `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645 Approved by: https://github.com/jgong5, https://github.com/masnesral	2024-08-29 23:33:15 +00:00
Shuqiang Zhang	d13ce2e2b5	[c10d] release gil lock during eager init (#134779 ) Summary: We found that if we init the pG in a background thread, it would block the main thread till init is complete. This is because in the pybinding we never release the GIL lock Test Plan: existing CI on eager init Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779 Approved by: https://github.com/c-p-i-o	2024-08-29 23:25:33 +00:00
Lucian Grijincu	71ff168dbb	pytorch: llvm_codegen: prefix JIT generated functions with 8B of data so jitted code can be called from ASAN+UBSAN on LLVM17 (llvm/llvm-project#65253) (#134572 ) Summary: Similar workaround was already applied elsewhere in pytorch https://github.com/pytorch/pytorch/pull/133623 {D61348865} LLVM17 UBSAN change discussion https://github.com/llvm/llvm-project/issues/104505 Here we also have to associate the data with the function with `setPrefixData(dummyPrefixData)` to prevent this workaround being disabled by the `optimize(*module_);` call which could change layout/remove the unused variable/etc. Differential Revision: D61845799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134572 Approved by: https://github.com/atalman	2024-08-29 23:15:13 +00:00
Laith Sakka	496e57283d	add add_loop benchmarks (#134652 ) This benchmark measure the cost of compiling the following function in eager and inductor its basically two benchmarks. ``` @torch.compile(backend=self.backend, fullgraph=True) def f(a, b): result = a.clone() for i in range(1000): if i % 3 == 0: result = result + b elif i % 3 == 1: result = result + 8 * b else: result = result.sin() return result ``` PYTHONPATH=$(pwd) python benchmarks/add_loop.py out ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8286649663 compile time instruction count for iteration 1 is 2838971338 compile time instruction count for iteration 2 is 2834263023 compile time instruction count for iteration 3 is 2829447493 compile time instruction count for iteration 4 is 2830904231 compile time instruction count for iteration 5 is 2830281077 compile time instruction count for iteration 6 is 2831466595 compile time instruction count for iteration 7 is 2830732164 compile time instruction count for iteration 8 is 2831088056 compile time instruction count for iteration 9 is 2831204407 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 32585687849 compile time instruction count for iteration 1 is 11747553436 compile time instruction count for iteration 2 is 11746959875 compile time instruction count for iteration 3 is 11749479461 compile time instruction count for iteration 4 is 11750053711 compile time instruction count for iteration 5 is 11750793958 compile time instruction count for iteration 6 is 11751673576 compile time instruction count for iteration 7 is 11754552912 compile time instruction count for iteration 8 is 11753723127 compile time instruction count for iteration 9 is 11759059942 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649	2024-08-29 23:04:01 +00:00
fduwjj	65864d0134	[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931 Approved by: https://github.com/H-Huang	2024-08-29 22:40:12 +00:00
Zhuoran Zhao	8b4c487581	Fix AOTInductor complication on ROCM (#134522 ) Summary: Original PR (https://github.com/pytorch/pytorch/pull/124123) is broken by cpp_builder refactoring So resubmit it to fix Test Plan: Test with command here: https://www.internalfb.com/phabricator/paste/view/P1549765548 Differential Revision: D61827208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134522 Approved by: https://github.com/frank-wei	2024-08-29 21:59:04 +00:00
Shunting Zhang	1e92d7b688	[inductor] move loop ordering after fusion (#126254 ) Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same. Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand). Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them. This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015 Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254 Approved by: https://github.com/jansel	2024-08-29 21:50:07 +00:00
min-jean-cho	416a7894fe	[Windows][XPU] Disable Kineto PTI on Windows only (#134620 ) Disable Kineto + XPU PTI on Windows only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620 Approved by: https://github.com/guangyey, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-29 20:58:55 +00:00
Xuehai Pan	7d12e6dceb	[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864, #133894	2024-08-29 20:56:16 +00:00
Xuehai Pan	a2566adfb6	[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864	2024-08-29 20:56:16 +00:00
Xuehai Pan	1b70366957	[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779	2024-08-29 20:56:16 +00:00
Xuehai Pan	eaa449fbf0	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769, #133778	2024-08-29 20:56:16 +00:00
Xuehai Pan	b5f1ffa7ab	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769	2024-08-29 20:56:16 +00:00
Xuehai Pan	e09324e7da	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel	2024-08-29 20:56:16 +00:00
drisspg	b977abd5de	[Inductor] Fix error checking for scaled_mm lowering (#134765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765 Approved by: https://github.com/Skylion007	2024-08-29 20:18:42 +00:00
atalman	6180574771	Move py 3.8->3.9 pull, trunk, inductor, prerioric CI tests (#133624 ) Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718 Except XPU and ROCM jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133624 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi	2024-08-29 19:15:59 +00:00
Jason Ansel	202e5cc87d	[inductor] Fix error in debug_str_extra (#134747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134747 Approved by: https://github.com/Skylion007, https://github.com/shunting314	2024-08-29 19:09:50 +00:00
Brian Vaughan	43e1df64f8	register all entry_point backends on first attempt (#132546 ) fixes: https://github.com/pytorch/pytorch/issues/131360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132546 Approved by: https://github.com/jansel	2024-08-29 18:59:29 +00:00
Ke Wen	5470fcd5b9	[5/N] Reconcile barrier and NaN checker (#134707 ) By using a zeros() tensor instead of empty() tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134345, #134357, #134701	2024-08-29 18:51:12 +00:00
zdevito	d91b49dbaa	expandable_segments <-> other allocator options (#134338 ) Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338 Approved by: https://github.com/ezyang	2024-08-29 18:43:59 +00:00
Rachel Guo	3fc6e47d42	[AOTI] Fix cosmetic indentation issue in cuda cpp wrapper codegen for DeferredCudaKernelLine/GridLine (#134705 ) Summary: Follow up fix for D61018114, D61800622 Increase indentation for `loadKernel` `launchKernel` and `Grid` lines. Test Plan: ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_unbacked_symbols_abi_compatible_cuda ``` ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_backed_symbols_abi_compatible_cuda ``` Differential Revision: D61927248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134705 Approved by: https://github.com/ColinPeppler	2024-08-29 18:38:45 +00:00
Aaron Gokaslan	5573c17877	[BE][Ez]: Update ruff to 0.6.3 (#134769 ) Mostly bugfix release, updating because it fixes an edgecase with a rule we are using Pull Request resolved: https://github.com/pytorch/pytorch/pull/134769 Approved by: https://github.com/albanD	2024-08-29 18:35:47 +00:00
Xintong Hu	ce96146623	[PT2] Fix node metadata setting in group_batch_fusion_aten (#134543 ) Summary: Current impl results in `meta` missing fields like`val`, use `FakeTensorProp` to update the information Differential Revision: D61832932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134543 Approved by: https://github.com/frank-wei	2024-08-29 18:32:04 +00:00
chilli	348d02a983	Changed masked out rows logsumexp to be -inf and not zero (#134650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134650 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng, https://github.com/drisspg	2024-08-29 17:22:52 +00:00
Pian Pawakapan	36a6516290	[export] use single FQN for param_buffer_mapping (#134500 ) Fixes #133252 In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks. Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500 Approved by: https://github.com/angelayi	2024-08-29 17:06:31 +00:00
Ke Wen	d9d95dc55e	[4/N] Test NaN checker against broadcast (#134701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134701 Approved by: https://github.com/wconstab ghstack dependencies: #134345, #134357	2024-08-29 17:00:07 +00:00
PyTorch MergeBot	ab646cd805	Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 )" This reverts commit `ba5aec88c6`. Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))	2024-08-29 16:39:19 +00:00
Ke Wen	26aea277f7	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134345	2024-08-29 16:25:27 +00:00
Xu Han	d503217ea4	[inductor] calibration inductor windows uts (15/N) (#134586 ) Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows. Changes: 1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue. 2. `PermissionError` setup as `delete=False`, let file not be auto deleted. 3. Open log file as "utf-8" to align with Linux. 4. Process wrap difference for Windows. 5. Delete tmp file manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586 Approved by: https://github.com/jansel	2024-08-29 16:18:40 +00:00
Ke Wen	9953f55f4c	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-29 16:13:15 +00:00
Bin Bao	387d3fc296	[AOTI] Switch benchmarking to use export non-strict mode (#130977 ) Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977 Approved by: https://github.com/angelayi ghstack dependencies: #134639	2024-08-29 16:08:52 +00:00

1 2 3 4 5 ...

77878 Commits