From b26efd0dd27949b9b5087f646e1834fb43c91aa9 Mon Sep 17 00:00:00 2001 From: Huy Do Date: Wed, 18 Jan 2023 20:14:56 +0000 Subject: [PATCH] Run bazel jobs on 4xlarge (#92340) After the previous fix to limit the CPU and memory used by Bazel, I see one case today where the runner runs out of memory in a "proper" way with exit code 137 https://hud.pytorch.org/pytorch/pytorch/commit/0c8f4b58934cbfe4a52d261c914ff8b2632c4f5c. So, the memory usage must be close to limit of an 2xlarge instance. It makes sense to preemptively use 4xlarge now (like XLA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92340 Approved by: https://github.com/clee2000 --- .github/workflows/_bazel-build-test.yml | 2 +- .jenkins/pytorch/build.sh | 10 ++-------- 2 files changed, 3 insertions(+), 9 deletions(-) diff --git a/.github/workflows/_bazel-build-test.yml b/.github/workflows/_bazel-build-test.yml index 036e16db060..2df7c2cd59e 100644 --- a/.github/workflows/_bazel-build-test.yml +++ b/.github/workflows/_bazel-build-test.yml @@ -26,7 +26,7 @@ jobs: build-and-test: # Don't run on forked repos. if: github.repository_owner == 'pytorch' - runs-on: [self-hosted, linux.2xlarge] + runs-on: [self-hosted, linux.4xlarge] steps: - name: Setup SSH (Click me for login details) uses: pytorch/test-infra/.github/actions/setup-ssh@main diff --git a/.jenkins/pytorch/build.sh b/.jenkins/pytorch/build.sh index 064522f5ad6..e6f76308a4f 100755 --- a/.jenkins/pytorch/build.sh +++ b/.jenkins/pytorch/build.sh @@ -192,14 +192,8 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then get_bazel - # Bazel build is run on a linux.2xlarge (or c5.2xlarge) runner with 8 CPU and 16GB of memory. - # The build job sometimes fails with 'runner lost communication with the server' flaky error - # which indicates that something there crashes the runner process. The most likely reason is - # that the build runs out of memory. So trying to follow https://bazel.build/docs/user-manual - # to limit the memory the CPU and memory usage, so that we will know even if it crashes with - # OOM error instead of crashing the runner and losing all the logs. - # - # Leave 1 CPU free and use only up to 80% of memory + # Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing + # the runner BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8" BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"