Run bazel jobs on 4xlarge (#92340)

After the previous fix to limit the CPU and memory used by Bazel, I see one case today where the runner runs out of memory  in a "proper" way with exit code 137 0c8f4b5893.  So, the memory usage must be close to limit of an 2xlarge instance.  It makes sense to preemptively use 4xlarge now (like XLA)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92340
Approved by: https://github.com/clee2000
This commit is contained in:
Huy Do 2023-01-18 20:14:56 +00:00 committed by PyTorch MergeBot
parent bb34461f00
commit b26efd0dd2
2 changed files with 3 additions and 9 deletions

View File

@ -26,7 +26,7 @@ jobs:
build-and-test:
# Don't run on forked repos.
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.2xlarge]
runs-on: [self-hosted, linux.4xlarge]
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main

View File

@ -192,14 +192,8 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
get_bazel
# Bazel build is run on a linux.2xlarge (or c5.2xlarge) runner with 8 CPU and 16GB of memory.
# The build job sometimes fails with 'runner lost communication with the server' flaky error
# which indicates that something there crashes the runner process. The most likely reason is
# that the build runs out of memory. So trying to follow https://bazel.build/docs/user-manual
# to limit the memory the CPU and memory usage, so that we will know even if it crashes with
# OOM error instead of crashing the runner and losing all the logs.
#
# Leave 1 CPU free and use only up to 80% of memory
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner
BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8"
BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"