pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-07 12:21:27 +01:00

Author	SHA1	Message	Date
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
William Wen	d298bd840f	[dynamo] add two-point iter test (#143500 ) Implements the last checkbox for https://github.com/pytorch/pytorch/issues/112532. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143500 Approved by: https://github.com/StrongerXi	2024-12-18 22:55:46 +00:00
Nikhil Gupta	d3ff2d42c2	[ARM][feat]: Add 4 bit dynamic quantization matmuls & KleidiAI Backend (#134124 ) Description: 1. Quantize Linear Layer Weights to 4-bits: Quantize the weights of the Linear layer to 4 bits, using symmetric quantization. Pack two 4-bit weights into one uint8 container. Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32. 2. Prepare Quantized Weights, Scales, and Optional Bias: After quantizing, obtain the quantized_weights, scales, and groupsize. If the original Linear layer has a bias, prepare it as well. 3. Pack the Weights Efficiently: Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias. ```python packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features) ``` Input parameters should include: in_features and out_features (the same as the Linear layer’s corresponding parameters). 4. Perform Dynamic Quantized Matrix Multiplication: Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights. ```python output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features) ``` Inputs required include: The input tensor, packed_weights , groupsize, and the in_features and out_features. API Usage: https://github.com/pytorch/pytorch/issues/143289 Model Perf : 7B Transformer model: Prefill : 340 t/s Decode : 40 t/s 2B Transformer model Prefill : 747 t/s Decode : 80 t/s Tests: python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight Ran 1 test in 0.016s OK python test/test_linalg.py -k test__dyn_quant_matmul_4bit Ran 8 tests in 0.077s OK python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit Ran 8 tests in 11.454s Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-12-18 22:30:07 +00:00
Huy Do	4717cd1ce9	Skip test_conv2d_linear_add_broadcast_shapes_cpu on fbcode (#143530 ) Summary: The test is added by D67376995 and it is failing on fbcode Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:mkldnn_pattern_matcher_cpu -- --exact 'caffe2/test/inductor:mkldnn_pattern_matcher_cpu - test_conv2d_linear_add_broadcast_shapes_cpu (caffe2.test.inductor.test_mkldnn_pattern_matcher.TestPatternMatcher)'` Differential Revision: D67413687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143530 Approved by: https://github.com/jansel	2024-12-18 22:08:08 +00:00
James	d4ed5941db	Fix floating point literals in IRPrinter (#142119 ) Fixes #114035 This is a recreation of #140002 with approval from its author. Original description: >when v larger than 1e16, the format will be error. example: v is 1.2e17, the output is 1.2e17.f, it have two point '.' Pull Request resolved: https://github.com/pytorch/pytorch/pull/142119 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-12-18 21:59:48 +00:00
Yidi Wu	10b9c5944e	[export] don't decompose custom triton op when exporting (#142426 ) For torch.export (strict and non-strict), we don't do functional decomposition. Instead, we preserve the custom triton ops as custom ops. This is because we want the exported program to be high-level and serializable. #### The alternative: If we decompose the custom op to a functional hop and make it a node in exported program, we need to figure out ways of serializing the hop and its arguments, which can be triton.jited python functions and triton dtypes. This is undesireble because: - it can be tedious to maintain layer that serialize the jited function (e.g. with a string) and dtypes. - changes to triton or the serialization logic for triton arguments can be BC breaking - exported program will expose the implementation detail (i.e. triton source code) for a specific backend (GPU) to users, which mixes levels of abstraction. #### Future plans: After this PR, in the short term, we expect users to have a seperate aot_compile stage that compiles the exported program into a Cubin file on the same machine that users call export, which does autotuning and removes triton dependency and serve the model with Cubin. This guarantees that triton changes won't break BC. In the long term, we may export multiple cubins for the triton op directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142426 Approved by: https://github.com/zou3519 ghstack dependencies: #142425	2024-12-18 21:36:28 +00:00
Yidi Wu	1e201422ed	[export] add is_exporting flag (#142425 ) We added an is_export flag under torch.compiler.is_exporting. This comes handy when we try to do some special logic in user-level and system-level (e.g. in upper of the stack). In increasing-scope: - `_is_fx_tracing` is set to True when we use under symbolic_trace or make_fx. - `is_exporting` is set to True when we're doing strict or non-strict export, which internally has a step that calls make_fx and set _is_fx_tracing to be True. - `is_compiling` is set to True when we're either doing strict, non-strict export or torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142425 Approved by: https://github.com/avikchaudhuri	2024-12-18 21:36:28 +00:00
Nichols A. Romero	894d47b91b	[ROCm] Fix unit test: matmul_offline_tunableop (#143322 ) Fixes #137936 The PR contains: * Fix for `matmul_offline_tunableop` * Clean-up try-finally blocks in UTs that don't use environment variables (`test_validator_tunableop_rocm`, `test_minimum_tuning_iteration_tunableop`, `test_disable_tuning_tunableop`) * Avoid the use of environment variables in `minimum_tuning_iteration_tunableop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143322 Approved by: https://github.com/jeffdaily	2024-12-18 20:14:44 +00:00
cyy	255a977494	[1/N] Avoid const_cast (#143169 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143169 Approved by: https://github.com/albanD	2024-12-18 19:48:01 +00:00
Nikita Shulga	f129bcb5a5	[BE] Refactor argument parsing into its own function (#143395 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143395 Approved by: https://github.com/kit1980, https://github.com/atalman, https://github.com/seemethere	2024-12-18 19:42:49 +00:00
Tom Ritchford	8d4926e30a	Fix unused variables in test/torch.py (#143399 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143399 Approved by: https://github.com/albanD	2024-12-18 17:57:24 +00:00
Sun, Jiayi	863e6e4567	Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670 ) Fix https://github.com/pytorch/pytorch/issues/141447. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:46:26 +00:00
Sun, Jiayi	b588a78ca3	add grad_output shape check for adaptive_max_pool2d_backward and adaptive_max_pool3d_backward (#141663 ) Fix https://github.com/pytorch/pytorch/issues/141099, https://github.com/pytorch/pytorch/issues/141100. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141663 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:44:27 +00:00
Mark Saroufim	93e8e32708	Remove iOS folder (#143398 ) This folder is a tutorial that is not packaged in PyTorch that's an example of how to use the now deprecated Lite Interpreter People should be using Executorch instead and there's already good documentation on it all over our tutorials and main homepage Testing to see what breaks in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/143398 Approved by: https://github.com/albanD	2024-12-18 17:25:52 +00:00
Joy Dong	ed9931e6ee	Add tests for non divisible inputs for flex decoding (#143214 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143214 Approved by: https://github.com/drisspg	2024-12-18 16:32:45 +00:00
Bin Bao	0e8013fc1c	[AOTI] Fix a typo in cpp_builder.py (#143351 ) Summary: passthough -> passthrough Pull Request resolved: https://github.com/pytorch/pytorch/pull/143351 Approved by: https://github.com/yushangdi, https://github.com/chenyang78 ghstack dependencies: #143350	2024-12-18 16:28:37 +00:00
Bin Bao	a2092665a9	[AOTI] Refactor path operations in AotCodeCompiler (#143350 ) Summary: Use safer pathlib operation instead of direct string manipulation; Update some path naming to make them more meaningful. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143350 Approved by: https://github.com/yushangdi, https://github.com/chenyang78	2024-12-18 16:28:37 +00:00
Nikita Shulga	24a18d76c8	[MPS] Use metal shaders for all view ops (#143375 ) Before this PR Metal shaders were used to scatter/gather 1-5 dimensional tensors. This PR introduces generalized ones that could be used for any dimensionality and as results gets rid of 700+ lines complex and untested code that might not even work as expected. Generalized gather shader looks as follows ```metal kernel void gather_kernel_n(uint linear_index [[thread_position_in_grid]], constant void * src_ [[buffer(0)]], device void * dst_ [[buffer(1)]], constant uint32_t * size [[buffer(2)]], constant uint32_t * stride [[buffer(3)]], constant uint32_t & numel [[buffer(4)]], constant int32_t & ndim [[buffer(5)]]) {{ if (linear_index >= numel) return; constant {0} * src = (constant {0} )src_; device {1} dst = (device {1} )dst_; uint64_t src_offs = 0; auto src_idx = linear_index; for(int dim = ndim - 1; dim >= 0; --dim) {{ src_offs += stride[dim] (src_idx % size[dim]); src_idx /= size[dim]; }} dst[linear_index] = cast<{1}>(src[src_offs]); }} ``` Which, according to the following benchmark ```python from timeit import default_timer import torch import torch.utils.cpp_extension from torch.utils.benchmark import Measurement, Timer t = Timer( stmt=f"y.copy_(x);torch.mps.synchronize()", setup=f"x=torch.rand(4, 5, 16, 64, 33, 24, dtype=torch.float32, device='mps')[:,:,:,:24,:24,];y=torch.empty(x.shape, device=x.device, dtype=x.dtype)", language="python", timer=default_timer ) print(t.blocked_autorange()) ``` Is almost twice as fast as previous implementation (i.e. on Mac Book M2 Pro it returns 2.9ms for MPS version vs 1.5ms for shader one On MacOS Sequoia [`gatherWithUpdatesTensor: indicesTensor:...`](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/gather(withupdatestensor:indicestensor:axis:batchdimensions:name:)?language=objc) crashes if invoked with complex data type, as one can see by running the code below ```swift import Metal import MetalPerformanceShadersGraph func gatherComplexMPS(device: MTLDevice, inp_buf: MTLBuffer, idx_buf: MTLBuffer, out_buf: MTLBuffer, inp_elem: Int, upd_elem: Int) { let graph = MPSGraph() let inputPlaceholder = graph.placeholder(shape: [inp_elem as NSNumber], dataType: .complexFloat32, name: nil) let indicesPlaceholder = graph.placeholder(shape: [upd_elem as NSNumber], dataType: .int64, name: nil) let outNode = graph.gather(withUpdatesTensor: inputPlaceholder, indicesTensor: indicesPlaceholder, axis: 0, batchDimensions: 0, name: nil) let mpsInputBuffer = MPSGraphTensorData(inp_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) let mpsIndicesBuffer = MPSGraphTensorData(idx_buf, shape: [upd_elem as NSNumber], dataType: .int64) let mpsOutputBuffer = MPSGraphTensorData(out_buf, shape: [inp_elem as NSNumber], dataType: .complexFloat32) guard let queue = device.makeCommandQueue() else { fatalError("Can't make queue") } graph.run(with: queue, feeds: [inputPlaceholder: mpsInputBuffer, indicesPlaceholder: mpsIndicesBuffer ], targetOperations: nil, resultsDictionary: [outNode: mpsOutputBuffer]) } func makeBufferWithValues<T>(device: MTLDevice, values: [T]) -> MTLBuffer { guard let buf = device.makeBuffer(length: values.count * MemoryLayout<T>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } let buf_data = buf.contents().assumingMemoryBound(to: T.self) for i in 0..<values.count { buf_data[i] = values[i] } return buf } guard let device = MTLCopyAllDevices().first else { fatalError("Not Metal device found") } print("Using device \(device.name)") let inp_buf = makeBufferWithValues(device: device, values: [1.0, 2.0 , 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]) let idx_buf = makeBufferWithValues(device: device, values: [0, 1, 2, 3]) guard let out_buf = device.makeBuffer(length:8 * MemoryLayout<Float>.size, options: [.storageModeShared]) else { fatalError("Can't alloc") } gatherComplexMPS(device: device, inp_buf: inp_buf, idx_buf: idx_buf, out_buf: out_buf, inp_elem: 4, upd_elem: 4) ``` Fixes https://github.com/pytorch/pytorch/issues/143140 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143375 Approved by: https://github.com/albanD	2024-12-18 16:15:46 +00:00
FFFrog	f47aac6bc2	Make Context to be Device-agnostic Step by Step (3/N) (#137578 ) Detailed Descriptions: - Using unified Device-agnostic API to create new generator for accelerator. - Add deprecated info for GeneratorForPrivateuseone Pull Request resolved: https://github.com/pytorch/pytorch/pull/137578 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-12-18 15:12:19 +00:00
albanD	80a42399bb	Various fix for memory leak in test autograd and dataloader (#143323 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143323 Approved by: https://github.com/andrewkho, https://github.com/soulitzer ghstack dependencies: #143225	2024-12-18 13:56:59 +00:00
bobrenjc93	84b91ce4a1	remove allow-untyped-defs for torch/_inductor/test_operators.py (#143436 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143436 Approved by: https://github.com/aorenste	2024-12-18 12:54:25 +00:00
Shangdi Yu	d8ea4ce631	[reland] Kill capture_pre_autograd_graph API (#143426 ) Summary: Delete the following API: - capture_pre_autograd_graph() - capture_pre_autograd_graph_using_training_ir() - gm_using_training_ir() Update XLA pin to include https://github.com/pytorch/xla/pull/8398 There's no more call sites to `capture_pre_autograd_graph`. Except 1) two test cases in coreml, guarded by version guard, PR to remove: https://github.com/apple/coremltools/pull/2400 2) a few call sites guarded by version guard (< 2.5.0) Test Plan: CI Differential Revision: D67354440 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143426 Approved by: https://github.com/gmagogsfm	2024-12-18 12:07:09 +00:00
Zizeng Meng	eb67dd3e2d	[3/N][Memory Profiling] Add memory profiling function for MTIA hooks (#142149 ) Design Doc: https://fburl.com/gdoc/47zpuweb Prototyping: D66469341 In this diff, we implement two new mtia hooks to start/stop profiler and export the memory snapshot. In next diff, we will integrate the mtia backend with profiler python api Differential Revision: [D66823583](https://our.internmc.facebook.com/intern/diff/D66823583/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142149 Approved by: https://github.com/nautsimon	2024-12-18 11:58:23 +00:00
Tom Ritchford	993b2f0ee0	Fix unused variables in test/test_transformers.py (#143407 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143407 Approved by: https://github.com/drisspg	2024-12-18 09:59:24 +00:00
bobrenjc93	8dd380803c	remove allow-untyped-defs for torch/_functorch/batch_norm_replacement.py (#143438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143438 Approved by: https://github.com/oulgen	2024-12-18 09:01:06 +00:00
bobrenjc93	75fe5a3ef7	remove allow-untyped-defs for torch/fx/experimental/debug.py (#143439 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143439 Approved by: https://github.com/oulgen	2024-12-18 08:55:46 +00:00
bobrenjc93	03991798ca	remove allow-untyped-defs for torch/nn/parallel/__init__.py (#143437 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143437 Approved by: https://github.com/oulgen	2024-12-18 08:50:37 +00:00
Aidyn-A	a99536480d	[ATen][Native][Special] Hermite polynomial prematurely return NaN if n is high (#141955 ) Hermite polynomials diverge to NaN at high orders due to numerical overflow. The proposal is to prematurely return NaN of it is known that at this value it will be NaN. According to my short test ```Python import torch device = "cuda" dtype = torch.float32 x = torch.linspace(-1000, 1000, 100000, device=device, dtype=dtype) for n in range(1024): if torch.special.hermite_polynomial_h(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_h: all outputs are nans! n = {n}") break for n in range(1024): if torch.special.hermite_polynomial_he(x, n).isnan().sum().item() == x.shape[0]: print(f"hermite_polynomial_he: all outputs are nans! n = {n}") break ``` The output values become NaNs at these orders: ``` hermite_polynomial_h: all outputs are nans! n = 53, dtype=torch.float32 hermite_polynomial_he: all outputs are nans! n = 61, dtype=torch.float32 hermite_polynomial_h: all outputs are nans! n = 272, dtype=torch.float64 hermite_polynomial_he: all outputs are nans! n = 304, dtype=torch.float64 ``` Surely, it makes sense to increase the limit as a safety margin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141955 Approved by: https://github.com/malfet, https://github.com/eqy	2024-12-18 08:30:08 +00:00
Sheng Fu	2ea4b56ec8	Record min/max of integral tensor in ET (#143088 ) Summary: In et-replay, random data is used to run the operators. However, it does not work well for the op that uses index to access tensor. For example, embedding ops, which use the indices to look up the embedding table. If random data is used for these index ops, et-replay usually runs into invalid memory access issue. To fix it, ET provides an environment variable "ENABLE_PYTORCH_EXECUTION_TRACE_INTEGRAL_TENSOR_RANGE", if it is set, ET will capture the min/max value of the flattened integral tensor. Then in et_replay, the min/max is used to generate the random tensor within that range. It fixed invalid memory access issue. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA.test_execution_trace_record_integral_tensor_range_cuda Differential Revision: D66666931 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143088 Approved by: https://github.com/sanrise	2024-12-18 08:20:35 +00:00
Avik Chaudhuri	bceedeec2b	fix checking non-trivial input constraints (#143442 ) A bunch of auto dynamic shape tests would fail non-strict retraceability because when checking input constraints, we'd compare non-trivial expressions, which would require / affect shape env. ``` ... is not tracked with proxy for <torch.fx.experimental.proxy_tensor._ModuleStackTracer object ... ``` I've also observed this bug internally. This PR does an early check on whether args passed have concrete shapes, and only then proceeds: as before, we 1. try to unify / solve with the arg dim when the corresponding placeholder node dim is symbolic in one symbol 2. check directly if the placeholder node dim is concrete 3. otherwise defer to run time. Differential Revision: [D67359596](https://our.internmc.facebook.com/intern/diff/D67359596/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143442 Approved by: https://github.com/tugsbayasgalan	2024-12-18 07:29:08 +00:00
qiurc	90cc43f270	Support garbage collection after pt2 compilation (#143364 ) Summary: Support garbage collection after pt2 compilation. Add jk to control the global rollout / rollback of this functionality Add env var to control individual job's rollout Test Plan: Test the model training job with / without this changes Reviewers: @yuxihu @ezyang , @Yuzhen11 , Subscribers: Tasks: Tags: Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/143364 Approved by: https://github.com/ezyang	2024-12-18 07:25:11 +00:00
Rachel Guo	9275091d6e	[provenance_tracking] Dump inductor_triton_kernel_to_post_grad_nodes.json info in debug_trace (#143055 ) Summary: This diff mainly adds code changes to dump `inductor_triton_kernel_to_post_grad_nodes.json` artifact which contains mapping info from post_grad -> inductor kernel code: `{"inductor_triton_kernel_name": [post_grad_node_0, post_grad_node_1, ..., ], "..."}.` Example paste: P1695235000 verified on the test model. See "Test Plan": We use this artifact to demonstrate provenance tracking in the frontend 3-tab highlighter tool: https://github.com/YUNQIUGUO/compiler_explorer (copy/pasted the input files for demo purpose for now and will integrate with Shangdi's tool to 4-tab) https://pxl.cl/66BzK Note: Currently only supports mapping for inductor's`TritonKernel` type. TODO for enhancing more support for `ExternKernel` and other inductor generated kernel type, etc. Test Plan: test_model_coverage.sh: ``` #!/bin/sh MODEL_ENTITY_ID=644688112 SNAPSHOT_ID=32 MODULE=merge # buck2 build --show-output mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true -c fbcode.nvcc_arch=a100,h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark TORCH_COMPILE_DEBUG=1 CUDA_VISIBLE_DEVICES=0 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCH_LOGS="+inductor, schedule, fusion, output_code" TORCH_TRACE="tmp/guorachel_tt" TORCHINDUCTOR_MAX_AUTOTUNE=1 TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 ../buck-out/v2/gen/fbcode/d29ee94b913014f1/caffe2/torch/fb/model_transform/experimental/benchmark/__mts_gpu_benchmark__/mts_gpu_benchmark.par --model-path manifold://ads_storage_fblearner/tree/user/facebook/fblearner/predictor/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/gpu_lowering/input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR_EP --gpu-trace --aot-inductor-config="{'max_autotune': True}" 2>&1 \| tee output.txt ``` {F1973765026} ``` buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:provenance_tracing -- --exact 'caffe2/test/inductor:provenance_tracing - test_triton_kernel_post_grad_mapping_aot_inductor (caffe2.test.inductor.test_provenance_tracing.TestProvenanceTracingArtifact)' ``` ``` TORCH_LOGS="+inductor, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:provenance_tracing -- -r test_triton_kernel_post_grad_mapping_aot_inductor ``` Differential Revision: D66967510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143055 Approved by: https://github.com/chenyang78	2024-12-18 06:51:50 +00:00
Digant Desai	6829897682	Remove assert from partitioner.py (#143376 ) Remove erroneous assert assuming a dependent (user) node to be in the partition. This partially reverts #136616 by removing the assert. Tested locally with a failing ExecuTorch Arm test using ``` $ python -m examples.arm.aot_arm_compiler --model_name mv2 --target ethos-u55-128 --delegate --quantize ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143376 Approved by: https://github.com/tarun292	2024-12-18 06:08:19 +00:00
Bert Maher	6715a8858a	Triton bump for 3.2 cherry-picks (device context) (#143409 ) Summary: * https://github.com/triton-lang/triton/pull/3731 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143409 Approved by: https://github.com/atalman	2024-12-18 05:17:29 +00:00
Shangdi Yu	c17a07ade3	Add float8 support in serde schema (#143343 ) Summary: Fix https://github.com/pytorch/pytorch/issues/141316 Bump up schema minor version. as title, add float8 support in serde schema Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_serialize_float8 ``` Differential Revision: D67307670 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143343 Approved by: https://github.com/yiming0416	2024-12-18 05:07:21 +00:00
emmettbicker	576789197a	Add support for CPU scalar in addcmul (#143264 ) Step required for performance in #143122 Adds support for CPU scalar for tensor_2 in addcmul. For example: ``` import torch a = torch.rand(2, 2, device="cuda") b = torch.tensor(1e-3) torch.add(a, b) torch.addcmul(a, a, b) # used to fail, now works ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/143264 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-12-18 04:43:29 +00:00
Natalia Gimelshein	859be14c4e	fix a few int64_t index computations, fix complex128 scan that had to… (#143401 ) …o few threads per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/143401 Approved by: https://github.com/eqy	2024-12-18 04:27:27 +00:00
Tom Ritchford	c947a7d38e	Fix unused Python variables in test/nn (#143396 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143396 Approved by: https://github.com/mikaylagawarecki	2024-12-18 03:30:54 +00:00
bobrenjc93	17a6d4b882	remove allow-untyped-defs for torch/_export/passes/remove_runtime_assertions.py (#143435 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143435 Approved by: https://github.com/oulgen	2024-12-18 03:05:20 +00:00
Nikita Shulga	a9de6a68f4	[CD] Test that all PyTorch wheels support OpenMP (#143394 ) Together with https://github.com/pytorch/pytorch/pull/143393 fixes https://github.com/pytorch/pytorch/issues/123225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143394 Approved by: https://github.com/atalman ghstack dependencies: #143393	2024-12-18 02:27:55 +00:00
atalman	2400db115c	Use Manylinux 2.28 for nightly build and cxx11-abi (#143423 ) As per: https://dev-discuss.pytorch.org/t/pytorch-linux-wheels-switching-to-new-wheel-build-platform-manylinux-2-28-on-november-12-2024/2581 Linux Builds: CPU, CUDA 11.8, CUDA 12.4 switched to Manylinux 2.28 and D_GLIBCXX_USE_CXX11_ABI=1 on the week of Dec 16 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143423 Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/seemethere	2024-12-18 02:02:58 +00:00
eellison	e890d67543	Use process pool for precompilation of triton templates (#142450 ) Perf results: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Tue%2C%2003%20Dec%202024%2022%3A57%3A51%20GMT&stopTime=Tue%2C%2010%20Dec%202024%2022%3A57%3A51%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/eellison/740/head&lCommit=b925256c29ec43e1933e4ede94b16d1f404b595f&rBranch=gh/eellison/740/base&rCommit=a161d6362f7d9db773322d2ce2a3a70aabbecf4b Training: <img width="793" alt="image" src="https://github.com/user-attachments/assets/75f5bc0d-8005-4213-ae88-0b94fb187dfc" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/142450 Approved by: https://github.com/jansel	2024-12-18 01:48:04 +00:00
Sun, Jiayi	c06b5048ba	[Inductor] Fix _can_be_inplace function (#143279 ) Summary: Modify _can_be_inplace function: return False if `_other.data` is an instance of `ir.BaseView`. Fix https://github.com/pytorch/pytorch/issues/143280. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143279 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel, https://github.com/jgong5	2024-12-18 00:26:05 +00:00
Mikayla Gawarecki	6cd96f069b	Add warning to torch.jit.load (#143403 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143403 Approved by: https://github.com/albanD ghstack dependencies: #143326	2024-12-18 00:17:41 +00:00
Mikayla Gawarecki	ac8342f881	Prevent torch.jit.load path in torch.load when weights_only=True (#143326 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143326 Approved by: https://github.com/albanD	2024-12-18 00:17:41 +00:00
soulitzer	13a5c15ef5	Fix sample inputs leaked from subtest (#143415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143415 Approved by: https://github.com/jbschlosser ghstack dependencies: #143333	2024-12-18 00:15:18 +00:00
soulitzer	3f99682fbd	NJT linear_backward should not return inner tensor as-is (#143333 ) Fixes debug=1 use-count checks https://github.com/pytorch/pytorch/actions/runs/12187808902/job/34002323481#step:22:2521 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143333 Approved by: https://github.com/jbschlosser	2024-12-18 00:15:18 +00:00
Felix Su	feb4818bc9	[SJD] adding kill logic for current process when killing a worker (#141060 ) Summary: we have seen cases where some workers don't receive stop signals, meaning watchdog isn't stopped accordingly. this diff introduces logic to kill the current pid alongside the worker pid something to note is that there is a case where the worker pid to be killed either doesn't exist or cannot be killed for some reason which will result in the current pid also not being killed. this seems okay since the watchdog loop will just attempt to kill the worker pid on the next iteration but just wanted to point this out Test Plan: experiment in next diff shows this works Differential Revision: D65837085 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141060 Approved by: https://github.com/gag1jain	2024-12-18 00:13:02 +00:00
Hyunho Yeo	efe21ee59d	[MTIA] (3/n) Implement PyTorch APIs to query/reset device peak memory usage (#143347 ) Summary: This diff implements the "max_memory_allocated" PyTorch API for MTIA devices, which returns the peak device DRAM usage Test Plan: Passed the local unit test ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/8444249544807192 Reviewed By: yuhc, egienvalue Differential Revision: D67118173 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143347 Approved by: https://github.com/nautsimon	2024-12-17 23:37:03 +00:00
Aleksei Nikiforov	a040006da7	Force symlink creation when building python on s390x (#143195 ) Sometimes it exists already when building on s390x This change should fix docker image build on s390x. Example of error can be found here: https://github.com/pytorch/pytorch/actions/runs/12282230596/job/34365267303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143195 Approved by: https://github.com/ezyang	2024-12-17 23:01:47 +00:00

1 2 3 4 5 ...

82517 Commits