Daily Source Reading: Kernel Main [FreeBSD]

Looking at the other side

Yesterday we looked at OpenBSD's main function. Today, we will look at how FreeBSD does it.

A different kind of startup

It took a bit longer to dissect and jump around, but yeah, FreeBSD has a very different approach (at least in style) to starting the kernel. We will see very early on why.

/*
 * System startup; initialize the world, create process 0, mount root
 * filesystem, and fork to create init and pagedaemon.  Most of the
 * hard work is done in the lower-level initialization routines including
 * startup(), which does memory initialization and autoconfiguration.
 *
 * This allows simple addition of new kernel subsystems that require
 * boot time initialization.  It also allows substitution of subsystem
 * (for instance, a scheduler, kernel profiler, or VM system) by object
 * module.  Finally, it allows for optional "kernel threads".
 */
void
mi_startup(void)
{
    struct sysinit *sip;
    int last;
#if defined(VERBOSE_SYSINIT)
    int verbose;
#endif

    TSENTER();

    if (boothowto & RB_VERBOSE)
        bootverbose++;

Run of the mill start of a function, just included for completeness. TSENTER here is merely for time logging if it is enabled.

/* Construct and sort sysinit list. */
sysinit_mklist(&sysinit_list, SET_BEGIN(sysinit_set), SET_LIMIT(sysinit_set));

Wait what? Remember all these different steps that the OpenBSD kernel took? Initializing virtual memory, setting up the scheduler, etc? They were kind of hardcoded. Easy to read and find, but less modular.

FreeBSD here takes a slightly different approach and treats these as subsystems. So we create a list of them. The sysinit_set is a collection of systems that are compiled in. There can be more, for example due to a kld, so keep that in mind.

The kernel has two macros sprinkled around, SYSINIT and C_SYSINIT (one for functions that take non-const arguments and one for functions that take const arguments). These add items to that set.

Each system has a priority and ordering (and sub-systems within it have an ordering). We will look at a few examples later, so that it'll make sense.

    last = SI_SUB_DUMMY;
#if defined(VERBOSE_SYSINIT)
    TUNABLE_INT_FETCH("debug.verbose_sysinit", &verbose_sysinit);
    verbose = 0;
#if !defined(DDB)
    printf("VERBOSE_SYSINIT: DDB not enabled, symbol lookups disabled.\n");
#endif
#endif

Just some printing, skipping.

/*
 * Perform each system initialization task from the ordered list.  Note
 * that if sysinit_list is modified (e.g. by a KLD) we will nonetheless
 * always perform the earlist-sorted sysinit at each step; using the
 * STAILQ_FOREACH macro would result in items being skipped if inserted
 * earlier than the "current item".
 */
while ((sip = STAILQ_FIRST(&sysinit_list)) != NULL) {
    STAILQ_REMOVE_HEAD(&sysinit_list, next);
    STAILQ_INSERT_TAIL(&sysinit_done_list, sip, next);

    if (sip->subsystem == SI_SUB_DUMMY)
        continue;   /* skip dummy task(s)*/

Now comes the easy part. We iterate over all the systems to initialize. Once we grab an item from sysinit_list, we throw it into sysinit_done, whether it succeeded or not.

If we hit a DUMMY system, we just happily skip it and continue iterating.

Next up is a bunch of logging and diagnostics.

        if (sip->subsystem > last)
            BOOTTRACE_INIT("sysinit 0x%7x", sip->subsystem);

#if defined(VERBOSE_SYSINIT)
        if (sip->subsystem != last && verbose_sysinit != 0) {
            verbose = 1;
            printf("subsystem %x\n", sip->subsystem);
        }
        if (verbose) {
#if defined(DDB)
            const char *func, *data;

            func = symbol_name((vm_offset_t)sip->func,
                DB_STGY_PROC);
            data = symbol_name((vm_offset_t)sip->udata,
                DB_STGY_ANY);
            if (func != NULL && data != NULL)
                printf("   %s(&%s)... ", func, data);
            else if (func != NULL)
                printf("   %s(%p)... ", func, sip->udata);
            else
#endif
                printf("   %p(%p)... ", sip->func,
                    sip->udata);
        }
#endif

I don't think this needs much explanation. And guess what, we are almost done.

        /* Call function */
        (*(sip->func))(sip->udata);

#if defined(VERBOSE_SYSINIT)
        if (verbose)
            printf("done.\n");
#endif

        /* Check off the one we're just done */
        last = sip->subsystem;
    }

We call the system's initializer with and store it as our last one, so we always know which one we processed last.

    TSEXIT();   /* Here so we don't overlap with start_init. */
    BOOTTRACE("mi_startup done");

    mtx_assert(&Giant, MA_OWNED | MA_NOTRECURSED);
    mtx_unlock(&Giant);

    /*
     * We can't free our thread structure since it is statically allocated.
     * Just sleep forever.  This thread could be repurposed for something if
     * the need arises.
     */
    for (;;)
        tsleep(__builtin_frame_address(0), PNOLOCK, "-", 0);
}

And we are pretty much done with our kernel start. Same as in OpenBSD, we just sleep forever while the system runs.

That can't be all, can it?

No, of course this isn't all. We saw in the OpenBSD version that there is a lot to do. Creating the initial process, creating the kqueue, etc.

Let's take a look at a few of these systems, starting with a super simple one and then looking at the big main one.

Let's start easy

Ok, let's take a super simple system. This is how it looks:

static void
print_caddr_t(const void *data)
{
    printf("%s", (const char *)data);
}

Yep, that is a whole system to be initialized. It's literally just a function that takes a void pointer to some data being passed in.

And here is how it gets added to the list:

C_SYSINIT(trademark, SI_SUB_COPYRIGHT, SI_ORDER_SECOND, print_caddr_t,
    trademark);

C_SYSINIT because we take constant data. But yeah, this literally just gets a name (trademark) and a system ID (SI_SUB_COPYRIGHT) and an ordering (SI_ORDER_SECOND), meaning it is a sub-system of the COPYRIGHT system and is in second place. All it will do is call the print_caddr_t function and pass the trademark data to it. Simple as that.

Congratulations, you now know how to write a subsystem for the FreeBSD kernel.

So let's now jump to something a tad more interesting. I mean, the OpenBSD kernel did so much, where is all that stuff here then?

The big chomper

There it is again…=process0=, the ancestor to all processes. Here it is treated as a "system". It is a tad long, so I will cut some parts out and mark them with .

/*
 * The two following SYSINIT's are proc0 specific glue code.  I am not
 * convinced that they can not be safely combined, but their order of
 * operation has been maintained as the same as the original init_main.c
 * for right now.
 */
/* ARGSUSED*/
static void
proc0_init(void *dummy __unused)
{
    struct proc *p;
    struct thread *td;
    struct ucred *newcred;
    struct uidinfo tmpuinfo;
    struct loginclass tmplc = {
        .lc_name = "",
    };
    vm_paddr_t pageablemem;
    int i;

Skipping past declarations, but just adding them here for completeness.

GIANT_REQUIRED;
p = &proc0;
td = &thread0;

GIANT_REQUIRED makes sure that we have a our giant lock. This is pretty much the same as the KERNEL_LOCK over on OpenBSD.

And also very similar, we grab our process and kernel thread. Same thing as in OpenBSD again, just a slight different naming (proc / thread vs. proc / process).

/*
 * Initialize magic number and osrel.
 */
p->p_magic = P_MAGIC;
p->p_osrel = osreldate;

/*
 * Initialize thread and process structures.
 */
procinit(); /* set up proc zone */
threadinit();   /* set up UMA zones */

/*
 * Initialise scheduler resources.
 * Add scheduler specific parts to proc, thread as needed.
 */
schedinit();    /* scheduler gets its house in order */

This is similar again. We set up the proc data and initialize memory pools. UMA is similar to OpenBSD's pool, predefined memory areas to just grab fixed size items from. Faster, easier, simpler.

Then we initialize the scheduler. In FreeBSD, there are actually two different ones. ULE (the newer and default one) and 4BSD (the older one). Which one gets initialized depends on which one has been compiled in.

/*
 * Create process 0.
 */
LIST_INSERT_HEAD(&allproc, p, p_list);
LIST_INSERT_HEAD(PIDHASH(0), p, p_hash);
mtx_init(&pgrp0.pg_mtx, "process group", NULL, MTX_DEF | MTX_DUPOK);
sx_init(&pgrp0.pg_killsx, "killpg racer");
p->p_pgrp = &pgrp0;
LIST_INSERT_HEAD(PGRPHASH(0), &pgrp0, pg_hash);
LIST_INIT(&pgrp0.pg_members);
LIST_INSERT_HEAD(&pgrp0.pg_members, p, p_pglist);

pgrp0.pg_session = &session0;
mtx_init(&session0.s_mtx, "session", NULL, MTX_DEF);
refcount_init(&session0.s_count, 1);
session0.s_leader = p;

Again, similar. We set up the data for process0. Only slight difference here is that FreeBSD already uses a lock for the process group instead of the big giant lock (the mtx_init call).

// ...
td->td_cpuset = cpuset_thread0();
// ...
strncpy(p->p_comm, "kernel", sizeof (p->p_comm));
strncpy(td->td_name, "kernel", sizeof (td->td_name));
// ...

Just a tiny difference here in the name. While OpenBSD stuck with the historical "swapper" name, here it's more aptly called "kernel".

/* Create credentials. */
newcred = crget();
newcred->cr_ngroups = 1;    /* group 0 */
// ...
/* Create sigacts. */
p->p_sigacts = sigacts_alloc();

/* Initialize signal state for process 0. */
siginit(&proc0);

/* Create the file descriptor table. */
p->p_pd = pdinit(NULL, false);
p->p_fd = fdinit();
p->p_fdtol = NULL;

/* Create the limits structures. */
p->p_limit = lim_alloc();
// ...

/* Initialize resource accounting structures. */
racct_create(&p->p_racct);

p->p_stats = pstats_alloc();

Pretty much the same as on OpenBSD. Set up credentials, signals, limits, file descriptors…

/* Allocate a prototype map so we have something to fork. */
p->p_vmspace = &vmspace0;
refcount_init(&vmspace0.vm_refcnt, 1);
pmap_pinit0(vmspace_pmap(&vmspace0));

/*
 * proc0 is not expected to enter usermode, so there is no special
 * handling for sv_minuser here, like is done for exec_new_vmspace().
 */
vm_map_init(&vmspace0.vm_map, vmspace_pmap(&vmspace0),
    p->p_sysent->sv_minuser, p->p_sysent->sv_maxuser);

Our process0 also needs virtual memory space, no surprise here. The code looks slightly different, but semantically same as what OpenBSD does.

    /*
     * Call the init and ctor for the new thread and proc.  We wait
     * to do this until all other structures are fairly sane.
     */
    EVENTHANDLER_DIRECT_INVOKE(process_init, p);
    EVENTHANDLER_DIRECT_INVOKE(thread_init, td);
#ifdef KDTRACE_HOOKS
    kdtrace_proc_ctor(p);
    kdtrace_thread_ctor(td);
#endif
    EVENTHANDLER_DIRECT_INVOKE(process_ctor, p);
    EVENTHANDLER_DIRECT_INVOKE(thread_ctor, td);

    /*
     * Charge root for one process.
     */
    (void)chgproccnt(p->p_ucred->cr_ruidinfo, 1, 0);
    PROC_LOCK(p);
    racct_add_force(p, RACCT_NPROC, 1);
    PROC_UNLOCK(p);
}
SYSINIT(p0init, SI_SUB_INTRINSIC, SI_ORDER_FIRST, proc0_init, NULL);

The interesting part here is . process_init here is a list. Other systems can register themselves in that list. "Please call me when you are done with this". A simple callback system basically.

So, any system that wanted to know when the process initialization has been completed now gets a callback. And with that, processes are initialized.

The init system

We wouldn't be complete without an init system. So we'll look at it here too, but I went and cut out a lot of the noise, mostly to save on space and time. If you are interested, feel free to go and check the code yourself in sys/kern/init_main.c.

/*
 * Start the initial user process; try exec'ing each pathname in init_path.
 * The program is invoked with one argument containing the boot flags.
 */
static void
start_init(void *dummy)
{
    vfs_mountroot();

    /* Wipe GELI passphrase from the environment. */
    kern_unsetenv("kern.geom.eli.passphrase");

    while ((path = strsep(&tmp_init_path, ":")) != NULL) {
        error = exec_alloc_args(&args);
        error = exec_args_add_fname(&args, path, UIO_SYSSPACE);
        error = exec_args_add_arg(&args, path, UIO_SYSSPACE);
        /*
         * Now try to exec the program.  If can't for any reason
         * other than it doesn't exist, complain.
         *
         * Otherwise, return via fork_trampoline() all the way
         * to user mode as init!
         */
        error = kern_execve(td, &args, NULL, oldvmspace);
    }
    panic("no init");
}

Cut down, it looks almost the same as the OpenBSD version. We mount root, look for the init program and get it running.

/*
 * Like kproc_create(), but runs in its own address space.  We do this
 * early to reserve pid 1.  Note special case - do not make it
 * runnable yet, init execution is started when userspace can be served.
 */
static void
create_init(const void *udata __unused)
{
    fr.fr_procp = &initproc;
    error = fork1(&thread0, &fr);
    cpu_fork_kthread_handler(FIRST_THREAD_IN_PROC(initproc),
        start_init, NULL);
}
SYSINIT(init, SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL);

Same again. We prepare the fork, and once it's ready, it would go and run start_init from above. But it will wait until the kthread system is ready.

/*
 * Make it runnable now.
 */
static void
kick_init(const void *udata __unused)
{
    td = FIRST_THREAD_IN_PROC(initproc);
    TD_SET_CAN_RUN(td);
    sched_add(td, SRQ_BORING);
}
SYSINIT(kickinit, SI_SUB_KTHREAD_INIT, SI_ORDER_MIDDLE, kick_init, NULL);

And here it is. Somewhere in the middle of the kthread subsystem, we kick init into gear and let it run. And yes, that flag there really is called SRQ_BORING.

Conclusion

In its core, both, OpenBSD and FreeBSD follow the same outline here. We create process0, wire it up with credentials, and virtual memory, initialize our internal accounting for processes etc.

Not really surprising as they both come from the same lineage. Same bones, different meat-suit.

What I like in the OpenBSD one: super easy to follow and no surprises. Just read the main function and follow it step by step like a damn cooking recipe. Readable, auditable, goal achieved I'd say.

The FreeBSD version is a bit more all over the place. But, it is more modular. A kld can use the same SYSINIT framework to run its own initialization at load time, re-using into the same ordering system. But then there's also the callbacks, etc. The goal of extensibility and flexibility is achieved 100% here, with the trade-off of readability.

Neither is better or worse. It comes down to what you need and having that choice is good. I love and run both systems with different use-cases.

Next up, … I don't know yet. Maybe some more kernel stuff, maybe back to some binaries? Or maybe we take a look at PF's code.

Let me know over on Mastodon if there is a topic you'd be interested in. Or feedback. Or just yell at me for getting stuff wrong, the chances are very high I got a lot wrong :D