Skip to content

Intermittent jobs stuck in “Waiting for a runner to pick up this job…” while ARC keeps DesiredReplicas=0 #4423

@deanjingshui

Description

@deanjingshui

Checks

Controller Version

0.10.1

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

This is not reproducible for every run, and other jobs with the same runs-on configuration are picked up correctly

Describe the bug

Summary
We’re seeing an intermittent issue where some GitHub Actions jobs remain in the UI with the status:
Waiting for a runner to pick up this job…

while our ARC runner scale set never scales up (DesiredReplicas stays at 0) and no EphemeralRunners are created in the corresponding namespace.

Most jobs in the same workflow run are handled correctly by ARC and execute on runners from the same scale set. Only a small subset of jobs (in our case two bench jobs) get stuck in this state.

Because this is not reproducible for every run, and other jobs with the same runs-on configuration are picked up correctly, this does not look like a static misconfiguration (labels, runner group, permissions, etc.). It looks more like ARC never receives the job assignment event for those specific jobs.

What happened
For workflow run:
Workflow run URL: https://github.com/web-infra-dev/rspack/actions/runs/23532375491/job/68501023764
Two jobs named bench show the status Waiting for a runner to pick up this job… for a long time and never start.
At the same time, on the ARC side:

# No ephemeral runners in the scale-set namespace
kubectl get ephemeralrunners -n runners-web-infra-dev-ubuntu-22-04-large
# output: No resources found in the namespace.

# EphemeralRunnerSet shows DesiredReplicas=0
kubectl get ephemeralrunnersets -n runners-web-infra-dev-ubuntu-22-04-large
NAME                             DESIREDREPLICAS   CURRENTREPLICAS   PENDING RUNNERS   RUNNING RUNNERS   FINISHED RUNNERS   DELETING RUNNERS
rspack-ubuntu-22.04-large-cdtk8  0                0                 0

Other jobs in the same workflow run (e.g. size-limit, reusable-build-*, test, etc.) are handled correctly:

Listener logs show Updating job info for the runner and Updating ephemeral runner with merge patch
Ephemeral runners are created
Jobs start and complete as expected
For the problematic bench jobs:

The GitHub Actions UI shows them as waiting for a runner
The ARC listener logs never mention these jobs (neither job name nor workflow run ID in combination with the bench job name)
The corresponding EphemeralRunnerSet never increases DesiredReplicas above 0
Because this only happens for some jobs, and only occasionally, we believe:

runs-on labels are correct (same as other jobs that do run)
Runner group permissions and Selected repositories are correct
ARC is generally able to scale runners and process jobs from this repository/scale set

Relevant listener logs (for the same workflow run)

From the listener pod:

kubectl logs -n arc-systems rspack-ubuntu-22.04-large-75697c4c-listener | grep -i 23532375491

Example excerpts (for other jobs in the same run):

2026-03-25T08:53:55Z INFO listener-app.worker.kubernetesworker Updating job info for the runner {"runnerName": "rspack-ubuntu-22.04-large-cdtk8-runner-ggghx", "ownerName": "web-infra-dev", "repoName": "rspack", "workflowRef": "web-infra-dev/rspack/.github/workflows/reusable-build-codspeed.yml@refs/pull/13446/merge", "workflowRunId": 23532375491, "jobDisplayName": "Test Linux / Codspeed-build / Codspeed-build-simulation", "requestId": 0}
2026-03-25T08:53:55Z INFO listener-app.worker.kubernetesworker Updating ephemeral runner with merge patch {"json": "{\"status\":{\"jobDisplayName\":\"Test Linux / Codspeed-build / Codspeed-build-simulation\",\"jobRepositoryName\":\"web-infra-dev/rspack\",\"jobWorkflowRef\":\"web-infra-dev/rspack/.github/workflows/reusable-build-codspeed.yml@refs/pull/13446/merge\",\"workflowRunId\":23532375491}}"}
...
2026-03-25T09:03:43Z INFO listener-app.worker.kubernetesworker Updating job info for the runner {"runnerName": "rspack-ubuntu-22.04-large-cdtk8-runner-pd6qv", "ownerName": "web-infra-dev", "repoName": "rspack", "workflowRef": "web-infra-dev/rspack/.github/workflows/reusable-build-test.yml@refs/pull/13446/merge", "workflowRunId": 23532375491, "jobDisplayName": "Test WASM / test / E2E Testing", "requestId": 0}
2026-03-25T09:03:43Z INFO listener-app.worker.kubernetesworker Updating ephemeral runner with merge patch {"json": "{\"status\":{\"jobDisplayName\":\"Test WASM / test / E2E Testing\",\"jobRepositoryName\":\"web-infra-dev/rspack\",\"jobWorkflowRef\":\"web-infra-dev/rspack/.github/workflows/reusable-build-test.yml@refs/pull/13446/merge\",\"workflowRunId\":23532375491}}"}

For the two bench jobs in the same run, there is no corresponding log line at all.

Additional observations / thoughts

The issue is intermittent and only affects a subset of jobs in a run
Because other jobs with the same runs-on and repository settings are picked up correctly, we suspect this is not a static misconfiguration (labels, runner group, permissions, etc.)
It feels like either:
The workflow_job/job-available event for these jobs is never sent to the scale set listener; or
ARC somehow drops/ignores those specific events, so it never bumps DesiredReplicas

Describe the expected behavior

For every job in the workflow run (including the two bench jobs), GitHub should send a workflow_job / job-available event to the scale set listener
ARC should:
Increase DesiredReplicas on the corresponding EphemeralRunnerSet
Create an ephemeral runner pod
The job should move from Waiting for a runner… to In progress and eventually complete (or fail for a more obvious reason)

Additional Context

githubConfigUrl: <xxx>
githubConfigSecret: <xxx>

template:
  spec:
    nodeSelector:
      kubernetes.io/os: linux
    tolerations:
    containers:
      - name: runner
        image: <xxx>
        command: ["/home/runner/run.sh"]
        resources:
          requests:
            memory: 40Gi
            cpu: 10000m

Controller Logs

https://gist.github.com/deanjingshui/676914d6c9be6b380f9b5a68ab49ae0b

Runner Pod Logs

In this case there are no runner pods to inspect. For the two bench jobs that are stuck in Waiting for a runner to pick up this job…, ARC never scales the EphemeralRunnerSet above DesiredReplicas=0, so no EphemeralRunner objects and no runner pods are created in the scale-set namespace (runners-web-infra-dev-ubuntu-22-04-large).
Because of that, there are no runner pod logs or kubectl describe output available for the affected jobs.

For reference, here is the log from a runner pod that successfully executed a job (different workflow, but same ARC setup):

kubectl logs -n runners-lynx-family-ubuntu-22-04-large \
  lynx-ubuntu-22.04-large-kjnf8-runner-fwr8h


√ Connected to GitHub
Current runner version: '2.333.0'
2026-03-25 14:56:07Z: Listening for Jobs
2026-03-25 14:56:11Z: Running job: static-check
2026-03-25 14:58:28Z: Job static-check completed with result: Succeeded
√ Removed .credentials
√ Removed .runner
Runner listener exit with 0 return code, stopping the service, no retry needed.
Exiting runner...

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions