mirror of
https://github.com/zebrajr/pytorch.git
synced 2025-12-06 12:20:52 +01:00
Run bazel jobs on 4xlarge (#92340)
After the previous fix to limit the CPU and memory used by Bazel, I see one case today where the runner runs out of memory in a "proper" way with exit code 137 0c8f4b5893. So, the memory usage must be close to limit of an 2xlarge instance. It makes sense to preemptively use 4xlarge now (like XLA)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/92340
Approved by: https://github.com/clee2000
This commit is contained in:
parent
bb34461f00
commit
b26efd0dd2
2
.github/workflows/_bazel-build-test.yml
vendored
2
.github/workflows/_bazel-build-test.yml
vendored
|
|
@ -26,7 +26,7 @@ jobs:
|
||||||
build-and-test:
|
build-and-test:
|
||||||
# Don't run on forked repos.
|
# Don't run on forked repos.
|
||||||
if: github.repository_owner == 'pytorch'
|
if: github.repository_owner == 'pytorch'
|
||||||
runs-on: [self-hosted, linux.2xlarge]
|
runs-on: [self-hosted, linux.4xlarge]
|
||||||
steps:
|
steps:
|
||||||
- name: Setup SSH (Click me for login details)
|
- name: Setup SSH (Click me for login details)
|
||||||
uses: pytorch/test-infra/.github/actions/setup-ssh@main
|
uses: pytorch/test-infra/.github/actions/setup-ssh@main
|
||||||
|
|
|
||||||
|
|
@ -192,14 +192,8 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
|
||||||
|
|
||||||
get_bazel
|
get_bazel
|
||||||
|
|
||||||
# Bazel build is run on a linux.2xlarge (or c5.2xlarge) runner with 8 CPU and 16GB of memory.
|
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
|
||||||
# The build job sometimes fails with 'runner lost communication with the server' flaky error
|
# the runner
|
||||||
# which indicates that something there crashes the runner process. The most likely reason is
|
|
||||||
# that the build runs out of memory. So trying to follow https://bazel.build/docs/user-manual
|
|
||||||
# to limit the memory the CPU and memory usage, so that we will know even if it crashes with
|
|
||||||
# OOM error instead of crashing the runner and losing all the logs.
|
|
||||||
#
|
|
||||||
# Leave 1 CPU free and use only up to 80% of memory
|
|
||||||
BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8"
|
BAZEL_MEM_LIMIT="--local_ram_resources=HOST_RAM*.8"
|
||||||
BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"
|
BAZEL_CPU_LIMIT="--local_cpu_resources=HOST_CPUS-1"
|
||||||
|
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue
Block a user