pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Valentine233	db36d21f5c	Add SDPA pattern for HuggingFace models BF16 (#121202 ) ### Description - Add pattern for bf16 input type with fp32 attention mask. (Example model: ElectraForCausalLM) - Add pattern with batch_size=1 to avoid some clones in graph. (Example model: text-classification+prajjwal1-bert-tiny) ### Newly matched models Dtype: bf16, machine: SPR #### Dynamo HuggingFace models - ElectraForCausalLM (speedup=2.09x) - ElectraForQuestionAnswering (speedup=4.22x) - AlbertForQuestionAnswering (speedup=1.36x) - AlbertForMaskedLM (speedup=1.39x) #### OOB HuggingFace models - multiple-choice+google-electra-base-discriminator - text-classification+prajjwal1-bert-tiny - text-classification+prajjwal1-bert-mini - text-classification+google-electra-base-generator - text-classification+bert-large-cased - casual-language-modeling+xlm-roberta-base - text-classification+roberta-base - text-classification+xlm-roberta-base - text-classification+albert-base-v2 - token-classification+google-electra-base-generator - masked-language-modeling+bert-base-cased Pull Request resolved: https://github.com/pytorch/pytorch/pull/121202 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-03-07 07:40:00 +00:00
Oguz Ulgen	953c6c37cb	Wrap remote cache creation with a try-catch (#121340 ) Summary: In production I am seeing errors like "AttributeError: module 'triton.runtime' has no attribute 'fb_memcache'", likely due to some package skew. Until this is resolved, lets wrap this code with try-catch. Test Plan: CI Differential Revision: D54604339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121340 Approved by: https://github.com/aakhundov	2024-03-07 07:05:49 +00:00
Chen_Liqing	291ce86a6c	Modify StorageImplCreateHelper (#118459 ) I want to use tensor.untyped_storage()[a:b] for ``PrivateUse1`` backend but fail. The code will go into ``THPStorage_get``: `bb6eba189f/torch/csrc/Storage.cpp (L525-L540)` Here ``torch`` will create a new ``c10::StorageImpl`` but not consider about ``PrivateUse1`` backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/118459 Approved by: https://github.com/albanD	2024-03-07 06:26:55 +00:00
Xia, Weiwen	f848e9c646	[Quant][Inductor] Fix q/dq per channel lowering with 64-bit qparams (#120984 ) Fixes #120869 Fix lowering of `quantize_per_channel` and `dequantize_per_channel` with float64 scale and int64 zero point. Generated codes are incorrect without explicit type conversion. Add type conversion to the lowering pass, i.e., float64 (double) -> float32 and int64 -> int32. Test plan python test/inductor/test_cpu_repro.py -k test_per_channel_fake_quant_module_uint8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120984 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jerryzh168	2024-03-07 06:23:52 +00:00
Yeounoh Chung	4f9d4e1ab0	[DTensor][XLA] refactor DTensor _xla API (#113214 ) In response to the change pytorch/xla#5776 and #92909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/113214 Approved by: https://github.com/wanchaol	2024-03-07 06:18:05 +00:00
cyy	c723514ef4	[CUDACachingAllocator] Simplify update_stat and avoid casts (#120964 ) update_stat in CUDACachingAllocator.cpp was split into increase and decrease functions in this PR to simplify the implementation and avoid type casts throughout the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120964 Approved by: https://github.com/albanD	2024-03-07 05:55:38 +00:00
drisspg	55232c4e1c	Make CausalBias a torch.Tensor subclass again (#121358 ) # Summary This was removed in #116071 in order to enable compile support and re-adding this seems to still work with compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/121358 Approved by: https://github.com/Skylion007, https://github.com/cpuhrsch	2024-03-07 05:20:47 +00:00
Xilun Wu	df2ad1fecc	[dtensor][debug] have visualize_sharding correctly print for sub-mesh DTensor (#121216 ) Summary In `visualize_sharding` we chose to only print on rank 0 (global rank) which means calling `visualize_sharind` will never print anything when the dtensor object's mesh doesn't include rank 0 (i.e. a sub-mesh). This PR has `visualize_sharding` always print on rank whose mesh coordinate is (0, 0, ..., 0) instead of whose global rank is 0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121216 Approved by: https://github.com/wanchaol ghstack dependencies: #121179, #120260	2024-03-07 04:50:15 +00:00
Xilun Wu	77873f6fe5	[dtensor][1/N] add torchrec even row-wise sharding example (#120260 ) Summary our goal is to demonstrate that DTensor's capability to represent TorchRec's parameter sharding. Currently this is done with `ShardedTensor` and theoretically `DTensor` can replace it with minor change. This PR serves as a start of this effort by adding an example test that represents TorchRec's `ShardingType.ROW_WISE` using DTensor. Note that this PR only covers the even sharding case. Test Run `torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/torchrec_sharding_example.py -e row-wise` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120260 Approved by: https://github.com/wanchaol ghstack dependencies: #121179	2024-03-07 04:50:15 +00:00
Xilun Wu	9cc0f23e5c	[dtensor][debug] allow visualize_sharding to print header (#121179 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121179 Approved by: https://github.com/wanchaol	2024-03-07 04:50:06 +00:00
jmarin	a2854ae904	Bugfix consume_prefix_in_state_dict_if_present function to keep the order of the state_dict (#117464 ) This PR proposes to keep the original order as the original state_dict, as the issue creator proposed. It also removes a bug concerning how ``_metadata`` is handled (see below), as well as other small changes to properly remove the prefix when is present. In the original code, ``_metadata`` was handled as a ``key``. ``` # also strip the prefix in metadata if any. if "_metadata" in state_dict: ``` This is not the case, ``_metadata`` is actually an ``attribute``. Hence, the previous condition is changed to: ``` # also strip the prefix in metadata if any. if hasattr(state_dict, "_metadata"): ``` This PR also includes the necessary test. Fixes #106942 Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/117464 Approved by: https://github.com/mikaylagawarecki	2024-03-07 04:00:49 +00:00
Aaron Orenstein	edd80f87b8	Prevent infinite recursion within Tensor.__repr__ (#120206 ) `Tensor.__repr__` calls functions which can perform logging which ends up logging `self` (with `__repr__`) causing an infinite loop. Instead of logging all the args in FakeTensor.dispatch log the actual parameters (and use `id` to log the tensor itself). The change to torch/testing/_internal/common_utils.py came up during testing - in some ways of running the test parts was `('test', 'test_testing.py')` and so `i` was 0 and we were doing a join on `()` which was causing an error. Repro: ``` import torch from torch.testing import make_tensor from torch._subclasses.fake_tensor import FakeTensor, FakeTensorMode t = torch.sparse_coo_tensor(((0, 1), (1, 0)), (1, 2), size=(2, 2)) t2 = FakeTensor.from_tensor(t, FakeTensorMode()) print(repr(t2)) ``` and run with `TORCH_LOGS=+all` Pull Request resolved: https://github.com/pytorch/pytorch/pull/120206 Approved by: https://github.com/yanboliang, https://github.com/pearu	2024-03-07 02:24:45 +00:00
laith sakka	eb4d87f237	graph break on sparse tensors constructions (#120458 ) Fix some tests in https://github.com/pytorch/pytorch/issues/119780 sparse_bsc_tensor is not supported https://github.com/pytorch/pytorch/pull/117907 Also more about the issue here. https://docs.google.com/document/d/1EIb4qG88-SjVFn5TloLERliYdxIu2hwYoAA8skjOVfo/edit Pull Request resolved: https://github.com/pytorch/pytorch/pull/120458 Approved by: https://github.com/ezyang	2024-03-07 02:17:41 +00:00
Wanchao Liang	1a28ebffb3	[TP] Introduce Sequence Parallel Style for Laynorm/RMSNorm/Dropout (#121295 ) As titled, this PR introduces a dedicated `ParallelStyle` to shard the nn.LayerNorm/nn.Dropout/RMSNorm layers. We were mainly using a manual distribute_module calls before when sharding the RMSNorm layer, but I think we should have a dedicate TP API to easily shard those layers, instead of user manually using DTensors. I call this SequenceParallel, which might bring some confusion that we technically "deprecated" a SequenceParallel style months ago. But this time the SeuqenceParallel style is significantly different with the previous ones (which used to shard two consecutive Linear layers). I believe making it the right name is the first priority, instead of worrying about the issue of reusing the old name Pull Request resolved: https://github.com/pytorch/pytorch/pull/121295 Approved by: https://github.com/awgu, https://github.com/tianyu-l ghstack dependencies: #121294	2024-03-07 02:04:59 +00:00
Eddie Yan	967dd31621	[cuDNN] Cleanup cuDNN < 8.1 ifdefs (#120862 ) Follow-up of #95722 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120862 Approved by: https://github.com/Skylion007	2024-03-07 01:46:25 +00:00
briancoutinho	b9087f8571	[profiler] Add execution_trace_observer as an optional argument to profiler (#119912 ) # Update Profiler API to collect Execution Traces ## TLDR We would like to simplify collecting Execution Trace and Kineto together. Execution Trace and Kineto both provide meaningful information that can be combined to enable benchmarking, performance analysis and simulating new hardware. ``` import torch def main(): with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ], … excution_trace_observer=ExecutionTraceObserver() # <<<<<<< NEW ) as prof: ... prof.step() ``` See test/profiler/test_profiler.py 'test_execution_trace_with_kineto' for an example of using this API. ## What are Execution Traces? [Chakra Execution Traces](https://github.com/mlcommons/chakra/wiki) offer a graph based representation of AI/ML workloads. It stands apart from conventional AI/ML frameworks by focusing on replay benchmarks, simulators, and emulators, prioritizing agile performance modeling and adaptable methodologies. - Chakra is part of ML Commons industry standard and is being adopted by other companies besides NVIDIA too. - At Meta we have instrumented PyPer framework to collect Execution Traces. More details on our [PyTorch implementation of Chakra can be found here](https://github.com/mlcommons/chakra/wiki) Chakra essentially enables benchmarking and co-design for ML Models without having to reproduce entier software stacks and helps companies collaborate together [[chakra paper](https://arxiv.org/pdf/2305.14516.pdf)] ## Why correlate Execution Trace with PyTorch/Kineto Trace Both Execution Traces and Kineto/ provide different types of information and combining. While PyTorch ETs focus on CPU operators with explicit dependencies between them, Kineto traces encode GPU operators with their start and end times. In addition, collecting them at different timestamps will be inaccurate as several operations (NCCL, Embedding lookup) are data dependent and may not match correctly. Thus, it makes sense to collect both ET and Kineto together. The problem is that there are two code paths. ## Proposal The proposal is to modify the PyTorch profiler (Kineto) API to enable execution trace to be collected simultaneously, see TLDR section # Testing Updated the unit test for collecting kineto and Execution Trace together. - Check the collected ET has right range of events. - Compare two sets of IDs - record func Ids in ET and external IDs in Kineto. We check if these have a constant difference. ``` pytest test/profiler/test_profiler.py -k test_execution_trace_with_kineto -rP Running 1 items in this shard test/profiler/test_profiler.py [W execution_trace_observer.cpp:682] Enabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [W execution_trace_observer.cpp:694] Disabling Execution Trace Observer STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:320] Completed Stage: Collection STAGE:2024-03-05 09:05:05 1119546:1119546 ActivityProfilerController.cpp:324] Completed Stage: Post Processing ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119912 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-03-07 01:30:26 +00:00
Lucas Pasqualin	eb1145436a	[DCP] Adds main in format utils (#120128 ) Adds main in format utils. Usage: `python -m torch.distributed.checkpoint.format_utils dcp_to_torch dcp_dir torch_file.pt` or `python -m torch.distributed.checkpoint.format_utils torch_to_dcp torch_file.pt dcp_dir` Differential Revision: [D53791355](https://our.internmc.facebook.com/intern/diff/D53791355/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120128 Approved by: https://github.com/fegin, https://github.com/wz337	2024-03-07 01:18:17 +00:00
cyy	5cc511f72f	Use c10::irange and fix other index types in ForeachReduceOp.cu (#121123 ) This PR follows the suggestions in #121066 and changes most loops to c10::irange. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121123 Approved by: https://github.com/soulitzer	2024-03-07 00:11:27 +00:00
Xiaodong Wang	c268ce4a6d	Make ATen-cpu cuda/rocm agnostic (#121082 ) Summary: This specific rocm logic will make aten-cpu code diverge between rocm and cuda. This is not good because we won't be able to share aten-cpu.so between rocm and cuda. More specifically, this will prevent us build aten-hip by default, which requires us to set up rocm specific rules which is an extra burden for our build system. Test Plan: sandcastle + oss ci Differential Revision: D54453492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121082 Approved by: https://github.com/jeffdaily, https://github.com/aaronenyeshi, https://github.com/albanD	2024-03-06 23:51:40 +00:00
Yichen Yan	e50ded03a6	Use type check for also `is_not` (#113859 ) Handle `is_not` for: `9647a251cb/torch/_dynamo/variables/builtin.py (L1314-L1317)` I noticed https://github.com/pytorch/pytorch/issues/111713 exists, I think it's no harm to land this first. Pull Request resolved: https://github.com/pytorch/pytorch/pull/113859 Approved by: https://github.com/Skylion007	2024-03-06 23:12:42 +00:00
Wanchao Liang	a88356f45c	[dtensor] make add_.Tensor/div_.Scalar to be linear pointwise instead (#121294 ) add_.Tensor and div_.Scalar should support linearity so that we delay the partial results. This fixes the additional collective in the layernorm layer that we seen Pull Request resolved: https://github.com/pytorch/pytorch/pull/121294 Approved by: https://github.com/tianyu-l	2024-03-06 22:52:18 +00:00
Edward Z. Yang	2f064d895c	Switch TORCH_TRACE to accept a directory by default (#121331 ) Directory is better because it works smoothly with distributed runs; otherwise you'd need to modify torchrun to setup distinct log names for each file. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Differential Revision: [D54597814](https://our.internmc.facebook.com/intern/diff/D54597814) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121331 Approved by: https://github.com/albanD	2024-03-06 22:46:18 +00:00
Andrew Gu	372f192050	[DTensor] Initialized RNG tracker if needed (#121328 ) Since we are already checking if the RNG tracker is initialized, there is no real performance difference between erroring vs. just initializing a default RNG tracker (which we choose to be the `OffsetBasedRNGTracker`). ``` pytest test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121328 Approved by: https://github.com/wanchaol ghstack dependencies: #120351	2024-03-06 22:21:44 +00:00
Denis Yaroshevskiy	b0e2ed4d67	removing some macros (#120314 ) Summary: Will be making some changes in the surrounding code, they are going to be easier without macros Differential Revision: D54001770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/120314 Approved by: https://github.com/zhxchen17	2024-03-06 22:06:05 +00:00
Lourencom	69cedc16c5	Add padding dimension checks and tests (#121298 ) Fixes #121093 Previously, calling the following functions with invalid padding dimensions would cause a segmentation fault: ``` torch._C._nn.replication_pad1d, torch._C._nn.replication_pad3d, torch._C._nn.replication_pad3d ``` To fix, added condition checking to raise a runtime error with a debug message instead, specifying the correct dimensions necessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121298 Approved by: https://github.com/mikaylagawarecki	2024-03-06 21:55:34 +00:00
Yifu Wang	d7a5e59647	[dynamo] support group=None when rewriting collectives (#121043 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121043 Approved by: https://github.com/awgu	2024-03-06 21:37:19 +00:00
lezcano	3fee05f242	Triage the remaining fallbacks (#121312 ) Building off work from @amjames. There may be some missclassifications, feel free to flag them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121312 Approved by: https://github.com/jansel	2024-03-06 21:23:47 +00:00
Andrew Gu	e865700f6a	[FSDP2] Added initial meta-device init support (#120351 ) This PR adds initial support for meta-device initialization for pre-training without loading from a state dict. The idea is to allow `fully_shard(module)` to return and still have sharded parameters on meta device. Then, the user is free to initialize them as they please, e.g. using `to_empty()`. We override `_apply` to achieve the following: - Reshard the parameters to ensure that sharded parameters are registered (for correctness) -- we will always need this - Pad new local tensors and use the padded local tensors (to handle uneven sharding) -- we will remove this once `DTensor` pads its local tensor We use the `swap_tensors` path in `_apply`. For now, this requires setting `torch.__future__.set_swap_module_params_on_conversion(True)`; however, in the future, this may be enabled by default for wrapper subclasses and will not need any explicit API call. If requiring this call is too intrusive in the short term, we can also call it in `_apply` or when importing `fully_shard`. ``` # Pre-training flow (no checkpoint) global_mesh = init_device_mesh(..., mesh_dim_names=("dp", "tp")) dp_mesh, tp_mesh = global_mesh["dp"], global_mesh["tp"] with torch.device("meta"): model = ... parallelize_module(model, tp_mesh, ...) fully_shard(model, mesh=dp_mesh, ...) for param in model.parameters(): assert param.device.type == "meta" model.to_empty(device="cuda") random.manual_seed(42, global_mesh) for module in model.modules(): if hasattr(module, "reset_parameters"): module.reset_parameters() ``` This PR includes some minor changes to allow the user to similarly cast the module to a different dtype after construction time but before forward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120351 Approved by: https://github.com/wanchaol	2024-03-06 21:18:25 +00:00
Johannes Aalto	3cf02c5e06	[Dev Container] Fix container build by preventing conda prompt (#121128 ) Without this the build will freeze with prompt: Proceed ([y]/n)? I'm using rootless podman in vscode instead of docker but I think it should not affect this. ..or does conda somehow detect Docker but not Podman? Anyway, this should not break anything. Btw, I also had to uncomment the line: "remoteUser": "root" in devcontainer.json to finish the post installation properly but I guess there might be other workarounds - and perhaps you don't want to run as root if your container has root privileges. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121128 Approved by: https://github.com/drisspg	2024-03-06 20:50:40 +00:00
angelayi	58ac4a2007	Remove llava from ci_expected_accuracy as it's flaky (#121322 ) https://github.com/pytorch/pytorch/pull/121029 added it into the CI but the test is flaky on hud. It alternates between fail_accuracy and fail_to_run Pull Request resolved: https://github.com/pytorch/pytorch/pull/121322 Approved by: https://github.com/desertfire	2024-03-06 20:47:01 +00:00
PyTorch MergeBot	23fb37fa41	Revert "[export] Serialize union fields with single entry dict. (#121263 )" This reverts commit `7feabe9b73`. Reverted https://github.com/pytorch/pytorch/pull/121263 on behalf of https://github.com/osalpekar due to A large number of inductor benchmarking jobs failing starting this PR. See for details: `7feabe9b73` ([comment](https://github.com/pytorch/pytorch/pull/121263#issuecomment-1981680049))	2024-03-06 19:58:55 +00:00
Tobias Ringwald	76f3663efe	Fixed a memory leak when calling from_numpy on a numpy array with an … (#121156 ) …unsupported dtype. Fixes #121138. The lambda function that DECREFs the object is not called when the dtype conversion functions throws. This PR moves the conversion before the INCREF, which prevents the memory leak. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121156 Approved by: https://github.com/soulitzer, https://github.com/albanD	2024-03-06 19:37:38 +00:00
Kurman Karabukaev	360761f7d0	[Torchelasic] Create root log directory by default (#121257 ) Summary: After refactoring in https://github.com/pytorch/pytorch/pull/120691, default behavior unintentionally was changes from creating tempdir for logging to not capturing any logs by torch Elastic Agent. Reverting the behavior to: - making tempdir when log dir is not specified - allowing non-empty root log dir - Note: in case attempt folder exists, it will be pruned here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/multiprocessing/api.py#L294 Differential Revision: D54531851 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121257 Approved by: https://github.com/d4l3k	2024-03-06 18:50:38 +00:00
Thiago Crepaldi	418568d2e3	Add Float8 support to onnx exporter (#121281 ) Fixes #106877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121281 Approved by: https://github.com/BowenBao, https://github.com/titaiwangms	2024-03-06 18:46:56 +00:00
cyy	5a2527db22	[Clang-tidy header][22/N] Fix clang-tidy warnings in aten/src/ATEN/.{cpp,h} (#121102 ) This PR continues to fix clang-tidy warnings in aten/src/ATEN/, following #120763. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121102 Approved by: https://github.com/Skylion007	2024-03-06 18:36:31 +00:00
Michael Lazos	c5ef4df274	guard on grads being `None` in compiled optimizers (#121291 ) Fixes #115607 We were missing guards when the grads were set to `None`. So if we compiled the optimizer with grads set to their proper value, and then with the grads set to `None` we'd continuously run the `None` version because all of the guards would pass and it would be ordered before the correct version in the cache. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121291 Approved by: https://github.com/Skylion007, https://github.com/anijain2305	2024-03-06 18:33:23 +00:00
Zhengxu Chen	7feabe9b73	[export] Serialize union fields with single entry dict. (#121263 ) Summary: remove "$type" and "$value" fields, instead only serialize as {type: value} for union fields directly. Test Plan: CI Differential Revision: D54553770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121263 Approved by: https://github.com/tugsbayasgalan	2024-03-06 18:16:16 +00:00
PaulZhang12	c66d68ba51	[PT2] Add tolist() to FunctionalTensor for torch.export (#121242 ) Adding tolist() to FunctionalTensor for torch.exporting TorchRec data types Pull Request resolved: https://github.com/pytorch/pytorch/pull/121242 Approved by: https://github.com/ezyang	2024-03-06 18:10:44 +00:00
Simon Fan	05c256849b	[compiled autograd] support custom ops backed by c++ autograd::Function (#120681 ) - Adds support for custom ops backed by c++ custom autograd functions, e.g. fbgemm - Include files more granularly to avoid namespace pollution and circular imports limitations: - requires user to audit their code and opt-in their custom autograd::Function via autograd::Function::is_traceable and maybe additional compiled_args + apply_with_saved implementation. this was the only way I can think of for soundness - will throw if we can't hash the saved_data i.e. for any non implemented type other than list and dict in at::IValue::hash `b0cfa96e82/aten/src/ATen/core/ivalue.cpp (L364)` - can technically silently fail if both the typeid hash and the typeid string name of the custom autograd::Function collide at the same time, and an identical autograd graph containing a different custom autograd::Function, yet that has an identical implementation, is called. this case seems extremely unlikely, and the only alternative to hash collision i can think of is compiling with reflection - tensors not saved via save_variables are not lifted, and are specialized on TensorImpl*'s hash (treated as a memory address). if needed, we can lift them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120681 Approved by: https://github.com/jansel	2024-03-06 18:01:56 +00:00
blorange-amd	b27d76949b	[ROCm] Enable several fake_crossref UTs on ROCm (#121112 ) Enabled unit tests: test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_linalg_norm_subgradients_at_zero_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_linalg_norm_subgradients_at_zero_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_norm_nuc_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_norm_nuc_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_amp_svd_cuda_float32 test_ops::TestFakeTensorCUDA::test_fake_crossref_backward_no_amp_svd_cuda_float32 Pull Request resolved: https://github.com/pytorch/pytorch/pull/121112 Approved by: https://github.com/ezyang	2024-03-06 17:36:47 +00:00
PyTorch MergeBot	b529c19bdf	Revert "Batch Norm Consolidation (#116092 )" This reverts commit `5680f565d5`. Reverted https://github.com/pytorch/pytorch/pull/116092 on behalf of https://github.com/jeffdaily due to broke ROCm, PR signal was clean but trunk was not, the merge should have been blocked but wasn't ([comment](https://github.com/pytorch/pytorch/pull/116092#issuecomment-1981373237))	2024-03-06 17:10:01 +00:00
jyomu	8dd4b6a78c	Fix venv compatibility issue by updating python_lib_path (#121103 ) Reference by sys.executable is the absolute path of the executable binary for the Python interpreter, which may not be appropriate. Instead, sys.base_exec_prefix is more suitable, and this change will correctly resolve the library when using venv. I have tested it with a venv created by rye. https://docs.python.org/3.6/library/sys.html#sys.executable > A string giving the absolute path of the executable binary for the Python interpreter, on systems where this makes sense. If Python is unable to retrieve the real path to its executable, [sys.executable](https://docs.python.org/3.6/library/sys.html#sys.executable) will be an empty string or None. https://docs.python.org/3.6/library/sys.html#sys.exec_prefix > A string giving the site-specific directory prefix where the platform-dependent Python files are installed; by default, this is also '/usr/local'. This can be set at build time with the --exec-prefix argument to the configure script. Specifically, all configuration files (e.g. the pyconfig.h header file) are installed in the directory exec_prefix/lib/pythonX.Y/config, and shared library modules are installed in exec_prefix/lib/pythonX.Y/lib-dynload, where X.Y is the version number of Python, for example 3.2. https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix > Set during Python startup, before site.py is run, to the same value as [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix). If not running in a [virtual environment](https://docs.python.org/3.6/library/venv.html#venv-def), the values will stay the same; if site.py finds that a virtual environment is in use, the values of [prefix](https://docs.python.org/3.6/library/sys.html#sys.prefix) and [exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.exec_prefix) will be changed to point to the virtual environment, whereas [base_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_prefix) and [base_exec_prefix](https://docs.python.org/3.6/library/sys.html#sys.base_exec_prefix) will remain pointing to the base Python installation (the one which the virtual environment was created from). Pull Request resolved: https://github.com/pytorch/pytorch/pull/121103 Approved by: https://github.com/ezyang	2024-03-06 17:00:46 +00:00
mingfeima	a427d90411	add int4 packed gemm support on CPU device (#117475 ) This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec` * WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec` * WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec` WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-03-06 16:25:53 +00:00
Guilherme Leobas	54d92f2e37	Add jacrev support in torch.compile (#121146 ) Changes are simple. Moved a few entries on trace_rules.py and included tests to compare the graph generated by jacrev Pull Request resolved: https://github.com/pytorch/pytorch/pull/121146 Approved by: https://github.com/zou3519	2024-03-06 16:05:33 +00:00
vfdev-5	49d1fd31cf	Fuse nodes with sizes (s0s1...,) and (s0, s1, s2, ...) (#120077 ) Description: - PR tries to fuse nodes with compatible sizes, for example `node1: (s0, s1, s2)` and `node2: (s0 * s1 * s2)`. On `main` these two nodes can be fused due to different sizes. With this PR we can recompute node2 size, body etc using node1 indexing constraint and thus be able to fuse two nodes. - this should influence only cpu device Example: ```python from unittest.mock import patch import torch from torch._inductor.graph import GraphLowering from torch._inductor import config # Force multple scheduler nodes creation to fuse them config.realize_opcount_threshold = 1 @torch.compile(fullgraph=True, dynamic=True) def fn(x: torch.Tensor, w1: torch.Tensor, w2: torch.Tensor) -> torch.Tensor: o1 = x * w1.view(1, 1, 1, -1) o2 = x * w2.view(1, 1, 1, -1) output = o1 + o2 return output in_nodes = [] outputs = [] run_node = GraphLowering.run_node graph_lowering_obj = None def run_node_alt(self, n): global graph_lowering_obj graph_lowering_obj = self in_nodes.append(n) output = run_node(self, n) outputs.append(output) return output x = torch.rand(1, 3, 32, 32) w1 = torch.randn(32) w2 = torch.randn(32) with patch.object(GraphLowering, "run_node", run_node_alt): fn(x, w1, w2) print("graph_lowering_obj.buffers:", graph_lowering_obj.buffers) print("graph_lowering_obj.scheduler:", graph_lowering_obj.scheduler.nodes) ``` Output on `main`: ``` graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg1_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul, origins={mul} )), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg4_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul_1, origins={mul_1} )), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf0, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(buf1, i3 + i1 * s0*2 + i2 s0) tmp2 = tmp0 + tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=add, origins={add} ))] graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1), SchedulerNode(name='buf2')] ``` Output on this PR: ``` graph_lowering_obj.buffers: [ComputedBuffer(name='buf0', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg1_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul, origins={mul} )), ComputedBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(arg3_1, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(arg4_1, i3) tmp2 = tmp0 * tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=mul_1, origins={mul_1} )), ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[1, s1, s0, s0], stride=[s0*2s1, s0*2, s0, 1]), data=Pointwise( 'cpu', torch.float32, def inner_fn(index): _, i1, i2, i3 = index tmp0 = ops.load(buf0, i3 + i1 s0*2 + i2 s0) tmp1 = ops.load(buf1, i3 + i1 * s0*2 + i2 s0) tmp2 = tmp0 + tmp1 return tmp2 , ranges=[1, s1, s0, s0], origin_node=add, origins={add} ))] graph_lowering_obj.scheduler: [FusedSchedulerNode(nodes=buf0_buf1_buf2)] ``` Context: While working on https://github.com/pytorch/pytorch/pull/120411, upsampling bicubic decomposition, I saw an extra for-loop in C++ generated code summing up two buffers. Exploring the cause, it happend due to buffer number of ops goes beyond `config.realize_opcount_threshold`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/120077 Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/peterbell10	2024-03-06 12:19:45 +00:00
Yukio Siraichi	aa0b0944d5	[dynamo] Re-dispatch `torch.Tensor.new` into `torch.Tensor.new_empty` method. (#121075 ) Fix: https://github.com/pytorch/xla/issues/6009 This PR adds another case to `TensorVariable.method_new` special case, where it re-dispatches `new` into `new_empty`. Since we are using fake tensors, the `new` call doesn't actually gets to the corresponding backend (e.g. XLA). So, things like the following might happen: ```python @torch.compile(backend="openxla") def foo(x): new_x = x.new(x.size()) # new_x.device() == "xla" # x.device() == "xla:0" return new_x + x a = torch.arange(10) foo(a.to(xm.xla_device())) ``` Resulting in the following error: ```python Traceback (most recent call last): ... File "torch/_dynamo/utils.py", line 1654, in get_fake_value ret_val = wrap_fake_exception( File "torch/_dynamo/utils.py", line 1190, in wrap_fake_exception return fn() File "torch/_dynamo/utils.py", line 1655, in <lambda> lambda: run_node(tx.output, node, args, kwargs, nnmodule) File "torch/_dynamo/utils.py", line 1776, in run_node raise RuntimeError(make_error_message(e)).with_traceback( File "torch/_dynamo/utils.py", line 1758, in run_node return node.target(args, *kwargs) File "torch/utils/_stats.py", line 20, in wrapper return fn(args, *kwargs) File "torch/_subclasses/fake_tensor.py", line 885, in __torch_dispatch__ return self.dispatch(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1224, in dispatch return self._cached_dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 955, in _cached_dispatch_impl output = self._dispatch_impl(func, types, args, kwargs) File "torch/_subclasses/fake_tensor.py", line 1445, in _dispatch_impl return self.wrap_meta_outputs_with_default_device_logic( File "torch/_subclasses/fake_tensor.py", line 1575, in wrap_meta_outputs_with_default_device_logic return tree_map(wrap, r) File "torch/utils/_pytree.py", line 900, in tree_map return treespec.unflatten(map(func, flat_args)) File "torch/utils/_pytree.py", line 736, in unflatten leaves = list(leaves) File "torch/_subclasses/fake_tensor.py", line 1550, in wrap ) = FakeTensor._find_common_device(func, flat_args) File "torch/_subclasses/fake_tensor.py", line 625, in _find_common_device merge_devices(arg) File "torch/_subclasses/fake_tensor.py", line 620, in merge_devices raise RuntimeError( torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function add>((FakeTensor(..., device='xla', size=(10,), dtype=torch.int64), FakeTensor(..., device='xla:0', size=(10,), dtype=torch.int64)), *{}): Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices xla, xla:0 ``` Using `new_empty`, instead, fixes this error because it uses the device from the source tensor, instead of inferring from the current dispatch key set. Pull Request resolved: https://github.com/pytorch/pytorch/pull/121075 Approved by: https://github.com/jansel	2024-03-06 11:49:27 +00:00
Animesh Jain	e3bd6efe72	[dynamo][guards-cpp-refactor] Prevent duplication of leaf guards (#121164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121164 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147, #121154	2024-03-06 08:36:45 +00:00
Animesh Jain	b6b2d5b00a	[dynamo][guards-cpp-refactor] Pass source name for debug ease (#121154 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121154 Approved by: https://github.com/jansel ghstack dependencies: #121121, #121147	2024-03-06 08:36:45 +00:00
Animesh Jain	52d89d8491	[dynamo][guards-cpp-refactor] Simplify DictGuardManager by removing KeyValueDictGuardManager (#121147 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121147 Approved by: https://github.com/jansel ghstack dependencies: #121121	2024-03-06 08:36:45 +00:00
Animesh Jain	af7f55ffc8	[dynamo][guards-cpp-refactor] Add argnames in pybind'ings (#121121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/121121 Approved by: https://github.com/jansel	2024-03-06 08:36:45 +00:00

... 3 4 5 6 7 ...

70496 Commits