pytorch

mirror of https://github.com/zebrajr/pytorch.git synced 2025-12-06 00:20:18 +01:00

History

Georgia Phillips b229455ddd Update placement utils and weights to handle meta device (#162842 ) Summary: This diff fixes two things which come up when testing a tgif-published pt2 model remote net: 1) Updates isSameDevice to handle meta device to avoid this error: ``` what(): Unsupported device typemeta and meta Exception raised from isSameDevice at fbcode/caffe2/torch/nativert/executor/PlacementUtils.cpp:20 ``` 2. Updates xl weight v2 loading logic in Weights.cpp to handle non-TBE xl-weights. Today, we enforce the device is the same for an old weight and new weight when replacing with ModelRunnerAdapter.setAttr(). However, the way we replace non-TBE xl weights is to find any weights on "meta" device and then replace them with their correct weight with real device from xl_weights folder. Therefore, the new weight and old weight will always have different devices and the device check is invalid. I don't think we've run into this so far bc non-TBE xl weights have not been thoroughly tested until now. Test Plan: Run MRS you model merge net, which uses non-TBE xl weights. Confirm that before change #1 we get error: ``` Unsupported device typemeta and meta ``` Then after change #1 and before change #2 we get: ``` what(): Mismatched device for merge.user_tower.linear.weight: meta vs cpu Exception raised from validateValue at fbcode/caffe2/torch/nativert/executor/Weights.cpp:374 ``` After change run is successful Command: ``` MODEL_ENTITY_ID=921242082 SNAPSHOT_ID=1269 module_name=merge SAMPLE_INPUT_DIR=/data/users/georgiaphillips/models/921242082/${SNAPSHOT_ID}/${module_name}_archive/package/data/sample_inputs buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100,a100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.${module_name} --moduleName=${module_name} --submodToDevice="merge\|cuda0" --benchmarkEnableProfiling=false --disableStaticRuntime=true --doNotRandomizeSampleInputs=true --benchmarkDontRebatchSamples=true --pytorch_predictor_sigmoid_static_dispatch_enable=false --pytorch_predictor_sigmoid_graph_passes_enable=false --sampleInputFilePath=${SAMPLE_INPUT_DIR}/${module_name}.pt ``` Rollback Plan: Differential Revision: D80713052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162842 Approved by: https://github.com/henryoier		2025-09-17 08:12:32 +00:00
..
backends	[nativert] aoti (#162353 )	2025-09-12 05:56:25 +00:00
common	Revert "Fix usage of forwarding references (#161094 )"	2025-09-04 20:35:41 +00:00
detail	kjt pytree registration (#161114 )	2025-09-13 03:57:43 +00:00
executor	Update placement utils and weights to handle meta device (#162842 )	2025-09-17 08:12:32 +00:00
graph	[PT2]: Overriding Tensor device by SubmodNameToDevice (#162144 )	2025-09-16 06:56:06 +00:00
kernels	[nativert] aoti (#162353 )	2025-09-12 05:56:25 +00:00
python	[nativert] Add OSS version of ModelRunner (#159268 )	2025-07-29 21:08:14 +00:00
__init__.py	[nativert] aoti (#162353 )	2025-09-12 05:56:25 +00:00
ModelRunner.cpp	[nativert] aoti (#162353 )	2025-09-12 05:56:25 +00:00
ModelRunner.h	expose number of outputs in native runtime for unified runtime (#161723 )	2025-09-04 01:20:31 +00:00
OVERVIEW.md	[BE][12/16] fix typos in torch/ (#156602 )	2025-07-02 22:55:29 +00:00