Date: 2026-02-20

Daily Source Reading: Scheduler

The Scheduler

Right, so we have the kernel boot process pat down now. But what is a kernel without processes? And those processes each deserve a turn to run. Therefore, we will be looking at the scheduler.

We will again look at the OpenBSD version, for three reasons:

  • It is simpler
  • The FreeBSD one is covered pretty well in Design and Implementation of the FreeBSD Operating System
  • FreeBSD's ULE scheduler is very good, but also way more complex.

But never say never, maybe we will go and take a look at it at some point.

Initialize everything

Let us start at the beginning. Before the scheduler is even remotely useful, we need to set up all its structures.

/*
 * sched_init_cpu is called from main() for the boot cpu, then it's the
 * responsibility of the MD code to call it for all other cpus.
 */
void
sched_init_cpu(struct cpu_info *ci)
{
    struct schedstate_percpu *spc = &ci->ci_schedstate;
    int i;

    for (i = 0; i < SCHED_NQS; i++)
        TAILQ_INIT(&spc->spc_qs[i]);

    spc->spc_idleproc = NULL;

    clockintr_bind(&spc->spc_itimer, ci, itimer_update, NULL);
    clockintr_bind(&spc->spc_profclock, ci, profclock, NULL);
    clockintr_bind(&spc->spc_roundrobin, ci, roundrobin, NULL);
    clockintr_bind(&spc->spc_statclock, ci, statclock, NULL);

    kthread_create_deferred(sched_kthreads_create, ci);

    TAILQ_INIT(&spc->spc_deadproc);
    SIMPLEQ_INIT(&spc->spc_deferred);

    /*
     * Slight hack here until the cpuset code handles cpu_info
     * structures.
     */
    cpuset_init_cpu(ci);
}

This code ain't much magic. There are SCHED_NQS run queues per CPU, each a TAILQ, so a list basically. At the moment of writing this (version 7.8), there are 32 queues. These are the priority buckets that processes will sit in while waiting for CPU time.

A process will be put into a queue depending on its priority.

After that, we go and initialize 4 timers. Initializing, not starting!

And the creation of the actual kernel thread that runs the scheduler is deferred until later. It waits until we are ready to schedule.

Eric Idle

Every CPU gets a kernel thread that runs its scheduler. Once we are ready, sched_kthreads_create runs and forks off an idle process for us.

void
sched_kthreads_create(void *v)
{
    struct cpu_info *ci = v;
    struct schedstate_percpu *spc = &ci->ci_schedstate;
    static int num;

    if (fork1(&proc0, FORK_SHAREVM|FORK_SHAREFILES|FORK_NOZOMBIE|
        FORK_SYSTEM|FORK_IDLE, sched_idle, ci, NULL,
        &spc->spc_idleproc))
        panic("fork idle");

    /* Name it as specified. */
    snprintf(spc->spc_idleproc->p_p->ps_comm,
        sizeof(spc->spc_idleproc->p_p->ps_comm),
        "idle%d", num);

    num++;
}

Every time we do this, we assign a new number. So our processes end up being called idle0, idle1, etc. See how we are forking directly from proc0? We are in legit kernel land here. Told ya the scheduler is important.

But, what does this idle thing actually do? Just busy loop? If you value your electricity bill and battery, you wouldn't want that.

void
sched_idle(void *v)
{
    struct schedstate_percpu *spc;
    struct proc *p = curproc;
    struct cpu_info *ci = v;

    KERNEL_UNLOCK();

    /*
     * The idle thread is setup in fork1(). When the CPU hatches we
     * enter here for the first time. The CPU is now ready to take
     * work and so add it to sched_all_cpus when appropriate.
     * After that just go away and properly reenter once idle.
     */
#ifdef __HAVE_CPU_TOPOLOGY
    if (sched_smt || ci->ci_smt_id == 0)
        cpuset_add(&sched_all_cpus, ci);
#else
    cpuset_add(&sched_all_cpus, ci);
#endif
    spc = &ci->ci_schedstate;

    KASSERT(ci == curcpu());
    KASSERT(curproc == spc->spc_idleproc);
    KASSERT(p->p_cpu == ci);

    SCHED_LOCK();
    p->p_stat = SSLEEP;
    mi_switch();

So, the idle process first adds the CPU to the set of CPUs able to receve work. After that we do some housekeeping and check if we are actually on the current CPU, if we are the idle process for that CPU and if the process knows where it belongs.

Then we lock the scheduler and go to sleep. With mi_switch we are handling control to whatever process wants to run.

We will re-enter once there is nothing to do and we can idle.

    while (1) {
        while (spc->spc_whichqs != 0) {
            struct proc *dead;

            SCHED_LOCK();
            p->p_stat = SSLEEP;
            mi_switch();

            while ((dead = TAILQ_FIRST(&spc->spc_deadproc))) {
                TAILQ_REMOVE(&spc->spc_deadproc, dead, p_runq);
                exit2(dead);
            }
        }

        splassert(IPL_NONE);

        smr_idle();

        cpuset_add(&sched_idle_cpus, ci);
        cpu_idle_enter();
        while (spc->spc_whichqs == 0) {
#ifdef MULTIPROCESSOR
            if (spc->spc_schedflags & SPCF_SHOULDHALT &&
                (spc->spc_schedflags & SPCF_HALTED) == 0) {
                cpuset_del(&sched_idle_cpus, ci);
                SCHED_LOCK();
                atomic_setbits_int(&spc->spc_schedflags,
                    spc->spc_whichqs ? 0 : SPCF_HALTED);
                SCHED_UNLOCK();
                wakeup(spc);
            }
#endif
            cpu_idle_cycle();
        }
        cpu_idle_leave();
        cpuset_del(&sched_idle_cpus, ci);
    }
}

So, idling is basically 2 loops nested. The inner one at the beginning () is for the case that there is still work to do. spc_whichqs is a bitmask that says "the runqueue at this position still has work". And that folks is why we currently have 32 runqueues. Lowest common denominator in supported platforms is 32 bits. And also a sweetspot. Amping it up to 64 queues is too fine-grained and management overhead.

Anyway, in that inner loop, we put ourselves to sleep, switch to the next runnable process and after that do some house-cleaning and remove dead processes.

Once there is nothing left to do, we leave that inner loop. We give smr a chance to clean up (later post, I promise!). Then we add our CPU back to the list of idle CPUs and we call cpu_idle_enter. That last one might be machine dependent, as some architectures allow to put a core to sleep and save extra power.

Then, while we have nothing to do (), we just keep in cpu_idle_cycle.

Exit stage left

When a process exits, it cannot simply free its own address space — it is still using the stack at that point. sched_exit handles this with a small trick.

/*
 * To free our address space we have to jump through a few hoops.
 * The freeing is done by the reaper, but until we have one reaper
 * per cpu, we have no way of putting this proc on the deadproc list
 * and waking up the reaper without risking having our address space and
 * stack torn from under us before we manage to switch to another proc.
 * Therefore we have a per-cpu list of dead processes where we put this
 * proc and have idle clean up that list and move it to the reaper list.
 */
void
sched_exit(struct proc *p)
{
    struct schedstate_percpu *spc = &curcpu()->ci_schedstate;

    TAILQ_INSERT_TAIL(&spc->spc_deadproc, p, p_runq);

    tuagg_add_runtime();

    KERNEL_ASSERT_LOCKED();
    sched_toidle();
}

The comment summarizes it already for us here. But I have no clue what the tuagg_add_runtime does. I am too overloaded to read that one up.

The sched_toidle is also called from other places, that's why it was made into its own function.

void
sched_toidle(void)
{
    struct schedstate_percpu *spc = &curcpu()->ci_schedstate;
    struct proc *idle;

#ifdef MULTIPROCESSOR
    /* This process no longer needs to hold the kernel lock. */
    if (_kernel_lock_held())
        __mp_release_all(&kernel_lock);
#endif

    if (ISSET(spc->spc_schedflags, SPCF_ITIMER)) {
        atomic_clearbits_int(&spc->spc_schedflags, SPCF_ITIMER);
        clockintr_cancel(&spc->spc_itimer);
    }
    if (ISSET(spc->spc_schedflags, SPCF_PROFCLOCK)) {
        atomic_clearbits_int(&spc->spc_schedflags, SPCF_PROFCLOCK);
        clockintr_cancel(&spc->spc_profclock);
    }

    atomic_clearbits_int(&spc->spc_schedflags, SPCF_SWITCHCLEAR);

    SCHED_LOCK();
    idle = spc->spc_idleproc;
    idle->p_stat = SRUN;

    uvmexp.swtch++;
    if (curproc != NULL)
        TRACEPOINT(sched, off__cpu, idle->p_tid + THREAD_PID_OFFSET,
            idle->p_p->ps_pid);
    cpu_switchto(NULL, idle);
    panic("cpu_switchto returned");
}

Run Forest, run!

So far we looked at idling processes and exiting processes. But that shit is boring, let me run something!

Well, we already know that each CPU has a queue of different priorities. There has to be something or someone that decides where to put it, right? Well, you're in luck. We are gonna look at it right now. Welcome to setrunqueue. It takes a CPU, a process we wanna schedule and a priority.

void
setrunqueue(struct cpu_info *ci, struct proc *p, uint8_t prio)
{
    struct schedstate_percpu *spc;
    int queue = prio >> 2;

    if (ci == NULL)
        ci = sched_choosecpu(p);

    KASSERT(ci != NULL);
    SCHED_ASSERT_LOCKED();
    KASSERT(p->p_wchan == NULL);
    KASSERT(!ISSET(p->p_flag, P_INSCHED));

    p->p_cpu = ci;
    p->p_stat = SRUN;
    p->p_runpri = prio;

    spc = &p->p_cpu->ci_schedstate;
    spc->spc_nrun++;
    TRACEPOINT(sched, enqueue, p->p_tid + THREAD_PID_OFFSET,
        p->p_p->ps_pid);

    TAILQ_INSERT_TAIL(&spc->spc_qs[queue], p, p_runq);
    spc->spc_whichqs |= (1U << queue);
    cpuset_add(&sched_queued_cpus, p->p_cpu);

    if (cpuset_isset(&sched_idle_cpus, p->p_cpu))
        cpu_unidle(p->p_cpu);
    else if (prio < spc->spc_curpriority)
        need_resched(ci);
}

Yeah, this is as simple as it looks. The queue we place it in is simply the priority shifted right by two, so basically divided by 4.

If no CPU is specified, we will go and choose one (hopefully one that isn't overloaded yet).

Then we assign the current CPU to the process, set it to SRUN and assign its priority.

Once the process is inserted into the queue, we set its bitmask to the queue we threw it in.

At the end, if the CPU we chose is idle, we kick it awake. Otherwise, if our process has higher priority, we reschedule it.

There is also a reverse of this operation. remrunqueue.

void
remrunqueue(struct proc *p)
{
    struct schedstate_percpu *spc;
    int queue = p->p_runpri >> 2;

    SCHED_ASSERT_LOCKED();
    spc = &p->p_cpu->ci_schedstate;
    spc->spc_nrun--;
    TRACEPOINT(sched, dequeue, p->p_tid + THREAD_PID_OFFSET,
        p->p_p->ps_pid);

    TAILQ_REMOVE(&spc->spc_qs[queue], p, p_runq);
    if (TAILQ_EMPTY(&spc->spc_qs[queue])) {
        spc->spc_whichqs &= ~(1U << queue);
        if (spc->spc_whichqs == 0)
            cpuset_del(&sched_queued_cpus, p->p_cpu);
    }
}

Pretty straightforward. Remove the process from the queue, clear the bit if the queue is empty now. If all the queues are empty, we remove the CPU from the set of queued CPUs.

Choosing the next process

sched_chooseproc is what we use to pick the next process to run on the current CPU.

struct proc *
sched_chooseproc(void)
{
    struct schedstate_percpu *spc = &curcpu()->ci_schedstate;
    struct proc *p;
    int queue;

    SCHED_ASSERT_LOCKED();

#ifdef MULTIPROCESSOR
    if (spc->spc_schedflags & SPCF_SHOULDHALT) {
        if (spc->spc_whichqs) {
            for (queue = 0; queue < SCHED_NQS; queue++) {
                while ((p = TAILQ_FIRST(&spc->spc_qs[queue]))) {
                    remrunqueue(p);
                    setrunqueue(NULL, p, p->p_runpri);
                    if (p->p_cpu == curcpu()) {
                        KASSERT(p->p_flag & P_CPUPEG);
                        goto again;
                    }
                }
            }
        }
        p = spc->spc_idleproc;
        if (p == NULL)
            panic("no idleproc set on CPU%d",
                CPU_INFO_UNIT(curcpu()));
        p->p_stat = SRUN;
        KASSERT(p->p_wchan == NULL);
        return (p);
    }
again:
#endif

    if (spc->spc_whichqs) {
        queue = ffs(spc->spc_whichqs) - 1;
        p = TAILQ_FIRST(&spc->spc_qs[queue]);
        remrunqueue(p);
        sched_noidle++;
        if (p->p_stat != SRUN)
            panic("thread %d not in SRUN: %d", p->p_tid, p->p_stat);
    } else if ((p = sched_steal_proc(curcpu())) == NULL) {
        p = spc->spc_idleproc;
        if (p == NULL)
            panic("no idleproc set on CPU%d",
                CPU_INFO_UNIT(curcpu()));
        p->p_stat = SRUN;
    } 

    KASSERT(p->p_wchan == NULL);
    KASSERT(!ISSET(p->p_flag, P_INSCHED));
    return (p);
}

On a halting CPU (SPCF_SHOULDHALT), we first migrate everything off to other CPUs by calling setrunqueue with NULL as the target, which lets sched_choosecpu pick a destination. CPU-pegged processes are the exception, they can stay, because moving them is costly. Then we jump to again to handle them normally. Once all migratable processes are gone, we return the idle process.

The normal path is straightforward: ffs (find first set bit) on spc_whichqs gives us the highest-priority non-empty queue. We take the first process from it. If there is nothing at all, we try to steal a process from another CPU. If that also fails, we run idle.

This is where it comes into play again that we decide the queue placement by shifting the priority by 2.

Choosing a CPU

Right, how do we actually go about choosing a CPU to run on? Can't be "just pick a random one". So first we look at what happens when a fresh process gets forked. We need to place it somewhere.

struct cpu_info *
sched_choosecpu_fork(struct proc *parent, int flags)
{
#ifdef MULTIPROCESSOR
    struct cpu_info *choice = NULL;
    int run, best_run = INT_MAX;
    struct cpu_info *ci;
    struct cpuset set;

#if 0
    /*
     * XXX
     * Don't do this until we have a painless way to move the cpu in exec.
     * Preferably when nuking the old pmap and getting a new one on a
     * new cpu.
     */
    /*
     * PPWAIT forks are simple. We know that the parent will not
     * run until we exec and choose another cpu, so we just steal its
     * cpu.
     */
    if (flags & FORK_PPWAIT)
        return (parent->p_cpu);
#endif

    /*
     * Look at all cpus that are currently idle and have nothing queued.
     * If there are none, pick the one with least queued procs first,
     * then the one with lowest load average.
     */
    cpuset_complement(&set, &sched_queued_cpus, &sched_idle_cpus);
    cpuset_intersection(&set, &set, &sched_all_cpus);
    if (cpuset_first(&set) == NULL)
        cpuset_copy(&set, &sched_all_cpus);

    while ((ci = cpuset_first(&set)) != NULL) {
        cpuset_del(&set, ci);

        run = ci->ci_schedstate.spc_nrun;

        if (choice == NULL || run < best_run) {
            choice = ci;
            best_run = run;
        }
    }

    if (choice != NULL)
        return (choice);
#endif
    return (curcpu());
}

Well, there you have it. I think the comment explains it better than I could.

Next up, similar function, but for the general case, not just new forks.

struct cpu_info *
sched_choosecpu(struct proc *p)
{
#ifdef MULTIPROCESSOR
    struct cpu_info *choice = NULL;
    int last_cost = INT_MAX;
    struct cpu_info *ci;
    struct cpuset set;

    /*
     * If pegged to a cpu, don't allow it to move.
     */
    if (p->p_flag & P_CPUPEG)
        return (p->p_cpu);

    sched_choose++;

    /*
     * Look at all cpus that are currently idle and have nothing queued.
     * If there are none, pick the cheapest of those.
     * (idle + queued could mean that the cpu is handling an interrupt
     * at this moment and haven't had time to leave idle yet).
     */
    cpuset_complement(&set, &sched_queued_cpus, &sched_idle_cpus);
    cpuset_intersection(&set, &set, &sched_all_cpus);

    /*
     * First, just check if our current cpu is in that set, if it is,
     * this is simple.
     * Also, our cpu might not be idle, but if it's the current cpu
     * and it has nothing else queued and we're curproc, take it.
     */
    if (cpuset_isset(&set, p->p_cpu) ||
        (p->p_cpu == curcpu() && p->p_cpu->ci_schedstate.spc_nrun == 0 &&
        (p->p_cpu->ci_schedstate.spc_schedflags & SPCF_SHOULDHALT) == 0 &&
        curproc == p)) {
        sched_wasidle++;
        return (p->p_cpu);
    }

    if (cpuset_first(&set) == NULL)
        cpuset_copy(&set, &sched_all_cpus);

    while ((ci = cpuset_first(&set)) != NULL) {
        int cost = sched_proc_to_cpu_cost(ci, p);

        if (choice == NULL || cost < last_cost) {
            choice = ci;
            last_cost = cost;
        }
        cpuset_del(&set, ci);
    }

    if (p->p_cpu != choice)
        sched_nmigrations++;
    else
        sched_nomigrations++;

    if (choice != NULL)
        return (choice);
#endif
    return (curcpu());
}

CPU-pegged processes go back to their pinned CPU immediately, no questions asked. Otherwise, we check if the process's current CPU is idle and unqueued. In that case, staying put on that CPU is good enough. If not, we compute a cost for each candidate CPU and pick the cheapest. We will look at that cost estimation later.

At the end, we handle a few counters, but those are for statistics.

Steal work away

Here's an interesting case. If a CPU has nothing to do, it will look around other CPUs and at their run-queues to take work away form them.

Basically a "hey I'm free, give me something".

/*
 * Attempt to steal a proc from some cpu.
 */
struct proc *
sched_steal_proc(struct cpu_info *self)
{
    struct proc *best = NULL;
#ifdef MULTIPROCESSOR
    struct schedstate_percpu *spc;
    int bestcost = INT_MAX;
    struct cpu_info *ci;
    struct cpuset set;

    KASSERT((self->ci_schedstate.spc_schedflags & SPCF_SHOULDHALT) == 0);

    /* Don't steal if we don't want to schedule processes in this CPU. */
    if (!cpuset_isset(&sched_all_cpus, self))
        return (NULL);

A halting CPU does not steal. It is going offline and should not be stealing any work to begin with. Similarly, if this CPU is not in sched_all_cpus, we do not steal.

    cpuset_copy(&set, &sched_queued_cpus);

    while ((ci = cpuset_first(&set)) != NULL) {
        struct proc *p;
        int queue;
        int cost;

        cpuset_del(&set, ci);

        spc = &ci->ci_schedstate;

        queue = ffs(spc->spc_whichqs) - 1;
        TAILQ_FOREACH(p, &spc->spc_qs[queue], p_runq) {
            if (p->p_flag & P_CPUPEG)
                continue;

            cost = sched_proc_to_cpu_cost(self, p);

            if (best == NULL || cost < bestcost) {
                best = p;
                bestcost = cost;
            }
        }
    }
    if (best == NULL)
        return (NULL);

    TRACEPOINT(sched, steal, best->p_tid + THREAD_PID_OFFSET,
        best->p_p->ps_pid, CPU_INFO_UNIT(self));

    remrunqueue(best);
    best->p_cpu = self;

    sched_stolen++;
#endif
    return (best);
}

Otherwise we walk every queued CPU, look at the highest-priority queue on each one, and apply the same cost function to find the best candidate to migrate over. CPU-pegged processes are skipped. The winner gets yanked off its current queue and reassigned to our CPU.

The mythical cost function

This one definitely deserves attention. This is the function that decides if it's worth it to migrate to a new CPU. In essence, it comes down to 3 factors. Remember, lower cost means better to move there.

int
sched_proc_to_cpu_cost(struct cpu_info *ci, struct proc *p)
{
    int cost = 0;
#ifdef MULTIPROCESSOR
    struct schedstate_percpu *spc;
    int l2resident = 0;

    spc = &ci->ci_schedstate;

    /*
     * First, account for the priority of the proc we want to move.
     * More willing to move, the lower the priority of the destination
     * and the higher the priority of the proc.
     */
    if (!cpuset_isset(&sched_idle_cpus, ci)) {
        cost += (p->p_usrpri - spc->spc_curpriority) *
            sched_cost_priority;
        cost += sched_cost_runnable;
    }
    if (cpuset_isset(&sched_queued_cpus, ci))
        cost += spc->spc_nrun * sched_cost_runnable;

Cost 1: if the target CPU is not idle, we need to consider the priority on the destination side and our own priority, scaled by a factor that is a bit of a wild guess.

/*
 * Try to avoid the primary cpu as it handles hardware interrupts.
 *
 * XXX Needs to be revisited when we distribute interrupts
 * over cpus.
 */
if (CPU_IS_PRIMARY(ci))
    cost += sched_cost_runnable;

Cost 2: comment explains it all I guess.

    /*
     * If the proc is on this cpu already, lower the cost by how much
     * it has been running and an estimate of its footprint.
     */
    if (p->p_cpu == ci && p->p_slptime == 0) {
        l2resident =
            log2(pmap_resident_count(p->p_vmspace->vm_map.pmap));
        cost -= l2resident * sched_cost_resident;
    }
#endif
    return (cost);
}

Cost 3: same here, the comment sums it up pretty nicely.

So, now we know what makes the scheduler decide where to move if needed and if it's worth it. The scaling values are a bit of a best guess scenario because it is very difficult to hard-science that. But I'd say it has been working well enough for a long time.

Dafuq is pegging?

No, not that kind of pegging. Get your mind out of the gutter.

Sometimes we want a process to run on a specific CPU, for example during CPU hotplug or certain synchronisation operations.

/*
 * Peg a proc to a cpu.
 */
void
sched_peg_curproc(struct cpu_info *ci)
{
    struct proc *p = curproc;

    SCHED_LOCK();
    atomic_setbits_int(&p->p_flag, P_CPUPEG);
    setrunqueue(ci, p, p->p_usrpri);
    p->p_ru.ru_nvcsw++;
    mi_switch();
}

To peg a process to a CPU, we set the flag, add it to the queue and yield. From now on, it will be pegged to that CPU.

void
sched_unpeg_curproc(void)
{
    struct proc *p = curproc;

    atomic_clearbits_int(&p->p_flag, P_CPUPEG);
}

The reverse is literally just clearing the flag.

void
sched_start_secondary_cpus(void)
{
    CPU_INFO_ITERATOR cii;
    struct cpu_info *ci;

    CPU_INFO_FOREACH(cii, ci) {
        struct schedstate_percpu *spc = &ci->ci_schedstate;

        if (CPU_IS_PRIMARY(ci) || !CPU_IS_RUNNING(ci))
            continue;
        atomic_clearbits_int(&spc->spc_schedflags,
            SPCF_SHOULDHALT | SPCF_HALTED);
#ifdef __HAVE_CPU_TOPOLOGY
        if (!sched_smt && ci->ci_smt_id > 0)
            continue;
#endif
        cpuset_add(&sched_all_cpus, ci);
    }
}

Hyperthreading

int
sysctl_hwsmt(void *oldp, size_t *oldlenp, void *newp, size_t newlen)
{
    CPU_INFO_ITERATOR cii;
    struct cpu_info *ci;
    int err, newsmt;

    newsmt = sched_smt;
    err = sysctl_int_bounded(oldp, oldlenp, newp, newlen, &newsmt, 0, 1);
    if (err)
        return err;
    if (newsmt == sched_smt)
        return 0;

    sched_smt = newsmt;
    CPU_INFO_FOREACH(cii, ci) {
        if (CPU_IS_PRIMARY(ci) || !CPU_IS_RUNNING(ci))
            continue;
        if (ci->ci_smt_id == 0)
            continue;
        if (sched_smt)
            cpuset_add(&sched_all_cpus, ci);
        else
            cpuset_del(&sched_all_cpus, ci);
    }

    return 0;
}

OpenBSD added the ability to enable hyperthreading. By default it is disabled, as a means to mitigate some attack surfaces. Toggling hw.smt at runtime simply adds or removes the non-primary SMT threads from sched_all_cpus. Once removed, sched_choosecpu will never pick them and work stealing will ignore them. The primary hardware thread per core (ci_smt_id = 0=) is always left alone.

Conclusion

The OpenBSD scheduler is minimal, readable and does its job. No elaborate feedback mechanisms, no dynamic priority adjustment happening here. Just a well-reasoned set of data structures and a handful of policy fine-tuning.

The FreeBSD ULE scheduler does a lot more. But it clocks in at like 3000+ lines, so it would have to be broken down into several posts or we would have to skip some parts. It considers histories of how processes have behaved and other things.

If there is interest, let me know over on Mastodon and I will try to summarize it more succinct than the book did.