Jobs Stuck In Created State

When jobs are stuck in the created state for an extended period of time this can be due to multiple issues:

High load
Outdated agent
Missing Volumes

High Load

When the cluster has a high number of jobs in queue, jobs may be in a created state for an extended period of time. The fairshare scheduler should help mitigate this when other users are the cause of the load (the fair share scheduler balances across users, not images/pipelines). If the user experiencing the stuck jobs is also the cause of heavy load, the user needs to wait for their in-progress jobs to complete before their other jobs can be scheduled.

Outdated Agents

A common issue for jobs being stuck in the created state after updating Thorium is the agent failed to update. Before the agent claims any job it will check the version of the API against its own. If it is the incorrect version then the agent will exit without claiming a job.

Getting the current version

In order to get the current api version run the following command:

Outdated Agents: Kubernetes

In order to determine if the agent version is incorrect on kubernetes, first get pod logs with the following:

kubectl get pods -n <NAMESPACE> | grep "/1" | awk '{print $1}' | kubectl logs -n <Namespace> -f

If any of the logs show the agent exiting without claiming a job due to version mismatch, run the following command to update the Thorium agent on all nodes.

kubectl rollout restart deployment operator -n thorium

Outdated Agents: Bare Metal

On bare metal machines the agent is auto updated by the Thorium reactor. To confirm if the version is correct, simply run the following command to check the reactor:

/opt/thorium/thorium-reactor -V

Then to check the agent, run the following:

/opt/thorium/thorium-agent -V

In order to update the reactor, run the following command:

In order to update the agent, run the following command:

Missing Volumes

Another common issue that can cause K8s-based workers to get stuck in the created state is missing volumes. This occurs when the user has defined their image to require a volume, but the volume has not been created in K8s. This causes the pod to be stuck in the ContainerCreating state in K8s. To get pods in this state run the following command

kubectl get pods -n <NAMESPACE> | grep "ContainerCreating"

Then looks to see if any pods have been in that state for an extended period of time by checking the age of the pod. For example, this pod has been stuck for 10 minutes and is likely missing a volume.

➜ ~ kubectl get pods
NAME       READY   STATUS              RESTARTS   AGE
underwater-basketweaver-njs8smrl 0/1     ContainerCreating   0          10m

To confirm this is the issue describe the pod with the following command and check the events:

kubectl describe pod/<POD> -n <NAMESPACE>

If there is an event similar to the following, you are missing a volume that needs to be created.

Events:
  Type     Reason       Age                   From               Message
  ----     ------       ----                  ----               -------
  Normal   Scheduled    10m45s                 default-scheduler  Successfully assigned <NAMESPACE>/<POD> to <NODE>
  Warning  FailedMount  51s (x12 over 10m45s)  kubelet            MountVolume.SetUp failed for volume "important-volume" : configmap "important-volume" not found

Keyboard shortcuts