Daily Source Reading: Kernel Main [OpenBSD]

Prologue and a confession

Ok, so we just had a nice stroll through the boot shire. But the real journey is just starting.

But before we dive into it, I have to admit that I lied a little bit in the previous post. Between the bootloader and our kernel main function is a little bit more code. The kernel code itself is mostly machine independent (MI) in its core. But, a few things are still machine dependent (MD). But they don't have anything to do with the bootloader. This happens, at least on amd64 in locore.S and locore0.s.

As a quick explainer, locore0.S will set up page tables, get the CPU into the right mode, just so that C code can actually run properly.

locore.S then will set up interrupt handlers, and some basic copy primitives that are really tricky to do in C without a runtime.

But those files are 2500+ lines of code and would absolute be too long for this blog, my patience and most importantly way above my assembly skill level.

I haven't done this kind of low level programming in more than two decades, so it will take me some time to read through those. But if there is enough interest, I will attempt to disect them at some point in the future.

For now, this is mostly a learning exercise for myself, with the bonus-effect of readers maybe also getting something out of it. Getting that deep into the assembly of setting up page-tables and stuff will require some free time to re-learn all that.

Grab a huge bottle of water, maybe a snack, this is going to be a slightly longer one.

The main to main them all

Now that we are blissfully ignoring the very low level details (despite how actually interesting they might be), we can finally take a look at what the kernel does when it just woke up.

We are still in OpenBSD land (we will compare it later to FreeBSD, I promise), the code is very self-explanatory. Somehow, the main function for the kernel is extremely well documented. To a point of being redundant almost.

To not force any reader to jump between the blog and a copy of the code, I will include the entire thing here (also as a snapshot, becase code keeps evolving). It's not too long and we are just here to get an overview.

We will divide it up into sections and I will try my best to explain what is going on. Thanks to the comments I won't have to explain the code much and can maybe attempt to dust off the areas of my grey matter that remember some distant past of OS knowledge.

In the beginning there was nothing

/*
 * System startup; initialize the world, create process 0, mount root
 * filesystem, and fork to create init and pagedaemon.  Most of the
 * hard work is done in the lower-level initialization routines including
 * startup(), which does memory initialization and autoconfiguration.
 */
/* XXX return int, so gcc -Werror won't complain */
int
main(void *framep)
{
    struct proc *p;
    struct process *pr;
    struct pdevinit *pdev;
    extern struct pdevinit pdevinit[];
    extern void disk_init(void);

    /*
     * Initialize the current process pointer (curproc) before
     * any possible traps/probes to simplify trap processing.
     */
    curproc = p = &proc0;
    p->p_cpu = curcpu();

The initial comment says it all. It even mentions that the hard work has already been done by the low-level code (locore0.S and locore.S), but we will just pretend we didn't see that and imaging that we are blissfully unaware.

First, we create our 0 process. That is basically us, the kernel. It cannot be created via fork, because there is no parent. We are the parent. The prime-ancestor if you will. So we assign it the curcpu, which probably comes from our locore.S code as I'd guess.

Which exact CPU, doesn't matter. The one we happen to run on. For now there is only this process, this kernel anyway. Nothing else is running.

/*
 * Initialize timeouts.
 */
timeout_startup();

/*
 * Attempt to find console and initialize
 * in case of early panic or other messages.
 */
config_init();      /* init autoconfiguration data structures */
consinit();

printf("%s\n", copyright);

So, later one we will need timers and timeouts. timeout_startup, from my understanding, only initializes and creates the structures for the timers, but doesn't run them yet.

It's the same with config_init. We are not running anything here yet, we are just preparing the structures and setting them up.

consinit on the other hand is from the MD (machine dependent) part. In the amd64 case, this function is actually just an empty stub, as machdep.c already took care of that for us.

And we finish this part by printing the copyright info.

#ifdef KUBSAN
    /* Initialize kubsan. */
    kubsan_init();
#endif

If we have support for the clang sanitizer enabled, we also initialize the functions for that.

Witness me

WITNESS_INITIALIZE();

KERNEL_LOCK_INIT();
SCHED_LOCK_INIT();

The witness is OpenBSD's lock-order verification. And as usual, we can read about it with man 4 witness. As a tl;dr, it tries to make sure that locks are acquired in the correct order and tries to intervene if a deadlock scenario is encountered.

After that we initialize the kernel lock. From my understanding it is one big lock, parallelism be damned. It's easier to reason about and easier to implement. But from reading around the mailing lists, this lock is being pushed down as far as possible to make more kernel parts multi-processing safe.

And then we initialize the scheduler lock. I am not sure if we will ever disect the scheduler, but maybe, we'll see.

Core subsystems

rw_obj_init();
uvm_init();
disk_init();        /* must come before autoconfiguration */
tty_init();     /* initialise tty's */
cpu_startup();

Here we are slowly getting the shape of an operating system. We initialize the core parts. We initialize read-write locks for finer grained locking, and then we move to the big uvm, the virtual memory system. Without it, you could say good bye to dynamic memory allocation.

Then we initialize the data for TTY (yes, we already initialized a console earlier, but this is for the general TTY system). And then we initialize and set up the cpu. This last one seems to be machine dependent again.

Chaos

random_start(boothowto & RB_GOODRANDOM);    /* Start the flow */

There's the flag we saw in the boot loader! It tells us if the loader was able to seed a good amount of entropy. With that, we go ahead and initialize the random number generator.

Hello network

    /*
     * Initialize mbuf's.  Do this now because we might attempt to
     * allocate mbufs or mbuf clusters during autoconfiguration.
     */
    mbinit();

#if NSTOEPLITZ > 0
    stoeplitz_init();
#endif

mbinit creates buffers for networking. Gotta save the incoming data somehow. And as the comment mentions, these need to be ready before we configure our network devices.

That stoeplitz thing right there I only recognized because of networking. It's a hash function that helps distribute packets across multiple cores to process them faster. But only if we actually compiled in support for it.

Plumbing pipes

/* Initialize sockets. */
soinit();

/* Initialize SRP subsystem. */
srp_startup();

/* Initialize SMR subsystem. */
smr_startup();

/*
 * Initialize process and pgrp structures.
 */
procinit();

/* Initialize file locking. */
lf_init();

/*
 * Initialize filedescriptors.
 */
filedesc_init();

/*
 * Initialize pipes.
 */
pipe_init();

/*
 * Initialize kqueues.
 */
kqueue_init();

/*
 * Initialize futexes.
 */
futex_init();
tslp_init();

I think this doesn't need much explanation. The comments here are enough and we have enough code to go through.

The main character (it's us!)

/* Create credentials. */
p->p_ucred = crget();
p->p_ucred->cr_ngroups = 1; /* group 0 */

/*
 * Create process 0 (the swapper).
 */
pr = &process0;
process_initialize(pr, p);

LIST_INSERT_HEAD(&allprocess, pr, ps_list);
LIST_INSERT_HEAD(PIDHASH(0), pr, ps_hash);
atomic_setbits_int(&pr->ps_flags, PS_SYSTEM);

/* Set the default routing table/domain. */
process0.ps_rtableid = 0;

LIST_INSERT_HEAD(&allproc, p, p_list);
pr->ps_pgrp = &pgrp0;
LIST_INSERT_HEAD(TIDHASH(0), p, p_hash);
LIST_INSERT_HEAD(PGRPHASH(0), &pgrp0, pg_hash);
LIST_INIT(&pgrp0.pg_members);
LIST_INSERT_HEAD(&pgrp0.pg_members, pr, ps_pglist);

pgrp0.pg_session = &session0;
session0.s_count = 1;
session0.s_leader = pr;

atomic_setbits_int(&p->p_flag, P_SYSTEM);
p->p_stat = SONPROC;
pr->ps_nice = NZERO;
strlcpy(pr->ps_comm, "swapper", sizeof(pr->ps_comm));

Remember how we assigned the current cpu to the current process? The current process is us, the kernel. But it's not really a real process yet because we didn't have any systems initialized.

Now we get to actually assign it a group (0).

Then comes the bookkeeping. OpenBSD has a distinction between a struct proc (a kernel thread) and a struct process (the UNIX process, which may contain multiple threads). process_initialize ties them together.

The process is inserted into every relevant list and hash table: the global process list, the PID hash, the thread ID hash, the process group hash. It's flagged as PS_SYSTEM (a system process, not killable by mere mortals).

Then we give it a name, "swapper". I believe the naming is more historic than actually having any specific requirement.

Signals and other things

/* Init timeouts. */
timeout_set(&p->p_sleep_to, endtsleep, p);

/* Initialize signal state for process 0. */
signal_init();
siginit(&sigacts0);
pr->ps_sigacts = &sigacts0;

/* Create the file descriptor table. */
p->p_fd = pr->ps_fd = fdinit();

/* Create the limits structures. */
lim_startup(&limit0);
pr->ps_limit = &limit0;

Now we give our kernel process some timeouts, which we will need when we want to go sleep for a bit. We also want to be able to use signals and have file descriptors.

We also initialize limits. Limits are things like maximum number of open files etc. By default, other processes inherit this from us.

Virtual space

/* Allocate a prototype map so we have something to fork. */
uvmspace_init(&vmspace0, pmap_kernel(), round_page(VM_MIN_ADDRESS),
    trunc_page(VM_MAX_ADDRESS), TRUE, TRUE);
p->p_vmspace = pr->ps_vmspace = &vmspace0;

p->p_addr = proc0paddr;             /* XXX */

We are not done yet. We got files, limits, signals…but we don't have any virtual memory space yet. The comment mentions it. We need this for our init process later, because we will run it using fork, and fork expects virtual memory to exist.

Counting ourselves

/*
 * Charge root for one process.
 */
(void)chgproccnt(0, 1);

Every process gets counted. For example to ensure number of maximum process limitations. We are being nice here and count ourselves.

Scheduler

/* Initialize run queues */
sched_init();
sleep_queue_init();
clockqueue_init(&curcpu()->ci_queue);
sched_init_cpu(curcpu());
p->p_cpu->ci_randseed = (arc4random() & 0x7fffffff) + 1;

/* Initialize timeouts in process context. */
timeout_proc_init();

/* Initialize task queues */
taskq_init();

To run processes, we will need a scheduler. So we initialize the needed data structures (run queues, all that stuff). For now it looks like we are only initializing it for the current CPU.

We also go ahead and initialize the task queues.

Network

/* Initialize the interface/address trees */
ifinit();
softnet_init();

Now we initialize the network device data structures with ifinit. Note, we only initialize the structures not the devices. That will come later.

We also create software interrupts for networking related stuff.

Device probing

    /* Lock the kernel on behalf of proc0. */
    KERNEL_LOCK();

#if NMPATH > 0
    /* Attach mpath before hardware */
    config_rootfound("mpath", NULL);
#endif

Right, remember that big kernel lock we initialized a bit ago? Here we acquire it, so that everything runs in order.

And before we initialize and discover any devices, we initialize the multipath (mpath) device. It's a pseudo-device that other devices can attach to. Because of that, it has to exist beforehand.

/* Configure the devices */
cpu_configure();

/* Configure virtual memory system, set vm rlimits. */
uvm_init_limits(&limit0);

/* Per CPU memory allocation */
percpu_init();

/* Reduce softnet threads to number of CPU */
softnet_percpu();

Finally, time to discover and initialize devices. The name here is a bit misleading, cpu_configure does a bit more. From the source:

* cpu_configure() is called at boot time and initializes the vba 
* device tables and the memory controller monitoring.  Available
* devices are determined (from possibilities mentioned in ioconf.c),
* and the drivers are initialized.

Then we initialize memory per CPU (some optimizations per core for caching etc), and trim down amount of networking threads to actual number of CPUs.

Filesystem

    /* Initialize the file systems. */
#if defined(NFSSERVER) || defined(NFSCLIENT)
    nfs_init();         /* initialize server/shared data */
#endif
    vfsinit();

If we have NFS enabled, initialize it. Otherwise, we just go ahead and initialize our virtual file system. Virtual because it abstracts the actual file system away from us.

Time and sharing

    /* Start real time and statistics clocks. */
    initclocks();

#ifdef SYSVSHM
    /* Initialize System V style shared memory. */
    shminit();
#endif

#ifdef SYSVSEM
    /* Initialize System V style semaphores. */
    seminit();
#endif

#ifdef SYSVMSG
    /* Initialize System V style message queues. */
    msginit();
#endif

    /* Create default routing table before attaching lo0. */
    rtable_init();

The comments here do most of the work. We initialize clocks and the shared memory system and some routing.

Wanna-be devices

    /* Attach pseudo-devices. */
    for (pdev = pdevinit; pdev->pdev_attach != NULL; pdev++)
        if (pdev->pdev_count > 0)
            (*pdev->pdev_attach)(pdev->pdev_count);
#ifdef DIAGNOSTIC
    pdevinit_done = 1;
#endif

#ifdef CRYPTO
    crypto_init();
    swcr_init();
#endif /* CRYPTO */

We have initialized our devices, but now it's time for pseudo devices. These are things like lo0 loopback, or vlan devices, all that. Devices that appear to be one, but don't actually exist.

Then we also initialize the cryptoraphic subsystems in the kernel. This will be needed for disk encryption and fun things like that.

    /*
     * Initialize protocols.
     */
    domaininit();

    initconsbuf();

#if defined(GPROF) || defined(DDBPROF)
    /* Initialize kernel profiling. */
    prof_init();
#endif

/* Enable per-CPU data. */
mbcpuinit();
kqueue_init_percpu();
pmap_init_percpu();
uvm_init_percpu();
evcount_init_percpu();

/* init exec */
init_exec();

/* Start the scheduler */
scheduler_start();

Now that we know what our CPU layout looks like, how many cores etc, we initialize some of these data structures for each CPU. Queues, virtual memory, etc. Pretty much anything that can be safely done by a core itself, let it do it with its own lock.

init_exec registers any executable format the kernel understands. ELF files, or shebang scripts.

And here comes the big one: we finally start the scheduler. So far it has only been the kernel running, not allowing for anything else to run at all. We are now ready for other processes to join the fold. The scheduler is now awake and waiting for processes to schedule.

Kernel's first baby…init

/*
 * Create process 1 (init(8)).  We do this now, as Unix has
 * historically had init be process 1, and changing this would
 * probably upset a lot of people.
 *
 * Note that process 1 won't immediately exec init(8), but will
 * wait for us to inform it that the root file system has been
 * mounted.
 */
{
    struct proc *initproc;

    if (fork1(p, FORK_FORK, start_init, NULL, NULL, &initproc))
        panic("fork init");
    initprocess = initproc->p_p;
}

Our first fork. The init system. The comment covers most of it and even explains why this is our first process.

But keep in mind that this is not running at full throttle yet. It is waiting.

/*
 * Create any kernel threads whose creation was deferred because
 * initprocess had not yet been created.
 */
kthread_run_deferred_queue();

/*
 * Now that device driver threads have been created, wait for
 * them to finish any deferred autoconfiguration.  Note we don't
 * need to lock this semaphore, since we haven't booted any
 * secondary processors, yet.
 */
while (config_pending)
    tsleep_nsec(&config_pending, PWAIT, "cfpend", INFSLP);

dostartuphooks();

I think the comments here do it better than I could summarize here.

Root filesystem

#if NVSCSI > 0
    config_rootfound("vscsi", NULL);
#endif
#if NSOFTRAID > 0
    config_rootfound("softraid", NULL);
#endif

    /* Configure root/swap devices */
    diskconf();

#ifdef DDB
    /* Make debug symbols available in ddb. */
    db_ctf_init();
#endif

    if (mountroot == NULL || ((*mountroot)() != 0))
        panic("cannot mount root");

    TAILQ_FIRST(&mountlist)->mnt_flag |= MNT_ROOTFS;

    /* Get the vnode for '/'.  Set p->p_fd->fd_cdir to reference it. */
    if (VFS_ROOT(TAILQ_FIRST(&mountlist), &rootvnode))
        panic("cannot find root vnode");
    p->p_fd->fd_cdir = rootvnode;
    vref(p->p_fd->fd_cdir);
    VOP_UNLOCK(rootvnode);
    p->p_fd->fd_rdir = NULL;

We load any SCSI or softraid devices if so configured, because the root filesystem might be hiding there.

Then it happens. mountroot. If we fail to mount that, no reason to continue. Can't really have a system without a root filesystem.

We then go and grab the vnode for / and assign it as the current directory for our kernel process and hold onto a reference for it.

Init needs a bit of help

/*
 * Now that root is mounted, we can fixup initprocess's CWD
 * info.  All other processes are kthreads, which merely
 * share proc0's CWD info.
 */
initprocess->ps_fd->fd_cdir = rootvnode;
vref(initprocess->ps_fd->fd_cdir);
initprocess->ps_fd->fd_rdir = NULL;

We already forked the init process, but we did that before we even had a root filesystem. So we give it some help and give it a working directory.

We are almost there. Just a few more steps, I promise.

Swap and more

    /*
     * Now can look at time, having had a chance to verify the time
     * from the file system. 
     */
    LIST_FOREACH(pr, &allprocess, ps_list) {
        nanouptime(&pr->ps_start);
    }
    nanouptime(&curcpu()->ci_schedstate.spc_runtime);

    uvm_swap_init();

    /* Create the pageout daemon kernel thread. */
    if (kthread_create(uvm_pageout, NULL, NULL, "pagedaemon"))
        panic("fork pagedaemon");

    /* Create the reaper daemon kernel thread. */
    if (kthread_create(reaper, NULL, &reaperproc, "reaper"))
        panic("fork reaper");

    /* Create the cleaner daemon kernel thread. */
    if (kthread_create(buf_daemon, NULL, &cleanerproc, "cleaner"))
        panic("fork cleaner");

    /* Create the update daemon kernel thread. */
    if (kthread_create(syncer_thread, NULL, &syncerproc, "update"))
        panic("fork update");

    /* Create the aiodone daemon kernel thread. */ 
    if (kthread_create(uvm_aiodone_daemon, NULL, NULL, "aiodoned"))
        panic("fork aiodoned");

#if !defined(__hppa__)
    /* Create the page zeroing kernel thread. */
    if (kthread_create(uvm_pagezero_thread, NULL, NULL, "zerothread"))
        panic("fork zerothread");
#endif

We do some time checking and initialize swap.

Then the kernel spawns its army of daemon threads. These run separately, not as the main part of the kernel thread itself. Each daemon is responsible for a specific task, mostly to clean up things, so the kernel itself doesn't have to do that.

pagedaemon: the page-out daemon. When memory gets tight, this is the process that decides which pages to throw away.
reaper: cleans up dead processes. When a process exits, this thing comes along and cleans up the mess to free up file handles etc.
cleaner: flushes dirty buffers to disk.
update: periodically calls sync to flush filesystem metadata to protect against data loss on crashes. Could have been called syncer to be honest.
aiodoned: asynchronous I/O related, best guess.
zerothread: pre-zeroes free pages in the background, so that when a process needs a fresh page, it doesn't have to wait for it to be cleared out. Always good to have and can run while the pages are not needed.

SMP, or: wait, there's more than one CPU?

#if defined(MULTIPROCESSOR)
    /* Boot the secondary processors. */
    cpu_boot_secondary_processors();
#endif

    /* Now that all CPUs partake in scheduling, start SMR thread. */
    smr_startup_thread();

    config_process_deferred_mountroot();

Yes, so far we have been running on one CPU. Now that everything is set up, we can fire up all the other CPUs to get true parallelism.

Then we process any configurations left that were waiting for root to exist.

Unleash the beast

Phew…that was a lot. We have everything configured now. Our devices are there, all data structures initialized, root mounted, kernel threads are running…

/*
 * Okay, now we can let init(8) exec!  It's off to userland!
 */
start_init_exec = 1;
wakeup((void *)&start_init_exec);

It's finally time. We can wake up the init process that has been patiently waiting for us. Once it kicks off, it is off to userland and we are done booting!

Minor cleanup

    /*
     * Start the idle pool page garbage collector
     */
#if defined(MULTIPROCESSOR)
    pool_gc_pages(NULL);
#endif

    start_periodic_resettodr();

Run a garbage collector in the background to reclaim pages from the CPUs.

start_periodic_resettodr seems to be to sync the hardware clock with the system clock.

The void

The kernel is done. process0 has done its job. We gave birth to init, the scheduler and the kernel threads. It has served its purpose.

We loop forever and tell our process0 to sleep. It has nothing else to do, but has to wait around, otherwise the computer would reboot. The children it created are running the show now.

        /*
         * proc0: nothing to do, back to sleep
         */
        while (1)
                tsleep_nsec(&proc0, PVM, "scheduler", INFSLP);
    /* NOTREACHED */
}

You can actually still see it. Look here:

rootnode ~ $ doas ps -l -p 0
  UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT       TIME COMMAND
    0     0     0   1 -18   0     0     0 schedul DK     ??    0:00.73 (swapper)

There it is, sitting with the wait message we gave it.

Conclusion

Let's ignore the fact that a lot of this requires knowledge about how operating systems work. Let's focus on the code itself. The structure is clear, concise and easy to follow. 90% of it is literally just a sequence of: initialize this, initialize that, run next step…

The actual sub-systems, the queues, the scheduler, the virtual memory system, all that is hiding the interesting parts.

But I hope this de-mystified the steps you see flashing past you as white text on blue background when you boot up your OpenBSD system.

I am not sure what we will be looking at next. Maybe we will cut down look at what FreeBSD does and look at differences in either semantics or just code-wise.