[Benchmarking] Enable HF_GPT2 benchmarking on Metal (#151721)

By building wheel with USE_DISTRIBUTED=1 Otherwise attempt to run ``` python3 benchmarks/dynamo/torchbench.py --performance --only hf_T5 --backend inductor --inference --devices mps ``` wil fail with ``` File "/Users/nshulga/Library/Python/3.10/lib/python/site-packages/transformers/modeling_utils.py", line 40, in <module> import torch.distributed.tensor File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/__init__.py", line 4, in <module> import torch.distributed.tensor._ops # force import all built-in dtensor ops File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/__init__.py", line 2, in <module> from ._conv_ops import * # noqa: F403 File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_ops/_conv_ops.py", line 5, in <module> from torch.distributed.tensor._dtensor_spec import DTensorSpec, TensorMeta File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/_dtensor_spec.py", line 6, in <module> from torch.distributed.tensor.placement_types import ( File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/tensor/placement_types.py", line 8, in <module> import torch.distributed._functional_collectives as funcol File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/_functional_collectives.py", line 9, in <module> import torch.distributed.distributed_c10d as c10d File "/Users/nshulga/git/pytorch/pytorch/torch/distributed/distributed_c10d.py", line 23, in <module> from torch._C._distributed_c10d import ( ModuleNotFoundError: No module named 'torch._C._distributed_c10d'; 'torch._C' is not a package ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151721 Approved by: https://github.com/wdvr, https://github.com/dcci, https://github.com/huydhn
2025-12-06 12:20:52 +01:00 · 2025-04-19 02:57:03 +00:00 · 2025-04-19 02:57:03 +00:00 · 843e4d11ba
commit 843e4d11ba
parent cfc4d74b0c
2 changed files with 10 additions and 7 deletions
--- a/.ci/pytorch/macos-build.sh
+++ b/.ci/pytorch/macos-build.sh
@ -34,11 +34,14 @@ if which sccache > /dev/null; then
 fi

 print_cmake_info
-
-# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
-# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
-USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
-
+if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
+  # Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
+  USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
+else
+  # Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
+  # that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
+  USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel
+fi
 if which sccache > /dev/null; then
  print_sccache_stats
 fi
--- a/.github/workflows/inductor-perf-test-nightly-macos.yml
+++ b/.github/workflows/inductor-perf-test-nightly-macos.yml
@ -38,7 +38,7 @@ jobs:
    uses: ./.github/workflows/_mac-build.yml
    with:
      sync-tag: macos-perf-py3-arm64-build
-      build-environment: macos-py3-arm64
+      build-environment: macos-py3-arm64-distributed
      runner-type: macos-m1-stable
      build-generates-artifacts: true
      # To match the one pre-installed in the m1 runners
@ -54,7 +54,7 @@ jobs:
    uses: ./.github/workflows/_mac-test.yml
    needs: macos-perf-py3-arm64-build
    with:
-      build-environment: macos-py3-arm64
+      build-environment: macos-py3-arm64-distributed
      # Same as the build job
      python-version: 3.9.12
      test-matrix: ${{ needs.macos-perf-py3-arm64-build.outputs.test-matrix }}