Daily Source Reading: Kernel Main [FreeBSD]
Looking at the other side
Yesterday we looked at OpenBSD's main function. Today, we will look
at how FreeBSD does it.
A different kind of startup
It took a bit longer to dissect and jump around, but yeah, FreeBSD
has a very different approach (at least in style) to starting the
kernel. We will see very early on why.
/* * System startup; initialize the world, create process 0, mount root * filesystem, and fork to create init and pagedaemon. Most of the * hard work is done in the lower-level initialization routines including * startup(), which does memory initialization and autoconfiguration. * * This allows simple addition of new kernel subsystems that require * boot time initialization. It also allows substitution of subsystem * (for instance, a scheduler, kernel profiler, or VM system) by object * module. Finally, it allows for optional "kernel threads". */ void mi_startup(void) { struct sysinit *sip; int last; #if defined(VERBOSE_SYSINIT) int verbose; #endif TSENTER(); if (boothowto & RB_VERBOSE) bootverbose++;
Run of the mill start of a function, just included for completeness.
TSENTER here is merely for time logging if it is enabled.
/* Construct and sort sysinit list. */ sysinit_mklist(&sysinit_list, SET_BEGIN(sysinit_set), SET_LIMIT(sysinit_set));
Wait what? Remember all these different steps that the OpenBSD
kernel took? Initializing virtual memory, setting up the scheduler,
etc? They were kind of hardcoded. Easy to read and find, but less
modular.
FreeBSD here takes a slightly different approach and treats these as
subsystems. So we create a list of them. The sysinit_set is a
collection of systems that are compiled in. There can be more, for
example due to a kld, so keep that in mind.
The kernel has two macros sprinkled around, SYSINIT and C_SYSINIT
(one for functions that take non-const arguments and one for functions
that take const arguments). These add items to that set.
Each system has a priority and ordering (and sub-systems within it have an ordering). We will look at a few examples later, so that it'll make sense.
last = SI_SUB_DUMMY; #if defined(VERBOSE_SYSINIT) TUNABLE_INT_FETCH("debug.verbose_sysinit", &verbose_sysinit); verbose = 0; #if !defined(DDB) printf("VERBOSE_SYSINIT: DDB not enabled, symbol lookups disabled.\n"); #endif #endif
Just some printing, skipping.
/* * Perform each system initialization task from the ordered list. Note * that if sysinit_list is modified (e.g. by a KLD) we will nonetheless * always perform the earlist-sorted sysinit at each step; using the * STAILQ_FOREACH macro would result in items being skipped if inserted * earlier than the "current item". */ while ((sip = STAILQ_FIRST(&sysinit_list)) != NULL) { STAILQ_REMOVE_HEAD(&sysinit_list, next); STAILQ_INSERT_TAIL(&sysinit_done_list, sip, next); if (sip->subsystem == SI_SUB_DUMMY) continue; /* skip dummy task(s)*/
Now comes the easy part. We iterate over all the systems to
initialize. Once we grab an item from sysinit_list, we throw it
into sysinit_done, whether it succeeded or not.
If we hit a DUMMY system, we just happily skip it and continue
iterating.
Next up is a bunch of logging and diagnostics.
if (sip->subsystem > last) BOOTTRACE_INIT("sysinit 0x%7x", sip->subsystem); #if defined(VERBOSE_SYSINIT) if (sip->subsystem != last && verbose_sysinit != 0) { verbose = 1; printf("subsystem %x\n", sip->subsystem); } if (verbose) { #if defined(DDB) const char *func, *data; func = symbol_name((vm_offset_t)sip->func, DB_STGY_PROC); data = symbol_name((vm_offset_t)sip->udata, DB_STGY_ANY); if (func != NULL && data != NULL) printf(" %s(&%s)... ", func, data); else if (func != NULL) printf(" %s(%p)... ", func, sip->udata); else #endif printf(" %p(%p)... ", sip->func, sip->udata); } #endif
I don't think this needs much explanation. And guess what, we are almost done.
/* Call function */ (*(sip->func))(sip->udata); #if defined(VERBOSE_SYSINIT) if (verbose) printf("done.\n"); #endif /* Check off the one we're just done */ last = sip->subsystem; }
We call the system's initializer with
and store it as our last one, so
we always know which one we processed last.
TSEXIT(); /* Here so we don't overlap with start_init. */ BOOTTRACE("mi_startup done"); mtx_assert(&Giant, MA_OWNED | MA_NOTRECURSED); mtx_unlock(&Giant); /* * We can't free our thread structure since it is statically allocated. * Just sleep forever. This thread could be repurposed for something if * the need arises. */ for (;;) tsleep(__builtin_frame_address(0), PNOLOCK, "-", 0); }
And we are pretty much done with our kernel start. Same as in
OpenBSD, we just sleep forever while the system runs.
That can't be all, can it?
No, of course this isn't all. We saw in the OpenBSD version that
there is a lot to do. Creating the initial process, creating the
kqueue, etc.
Let's take a look at a few of these systems, starting with a super simple one and then looking at the big main one.
Let's start easy
Ok, let's take a super simple system. This is how it looks:
static void print_caddr_t(const void *data) { printf("%s", (const char *)data); }
Yep, that is a whole system to be initialized. It's literally just a
function that takes a void pointer to some data being passed in.
And here is how it gets added to the list:
C_SYSINIT(trademark, SI_SUB_COPYRIGHT, SI_ORDER_SECOND, print_caddr_t, trademark);
C_SYSINIT because we take constant data. But yeah, this literally
just gets a name (trademark) and a system ID (SI_SUB_COPYRIGHT) and
an ordering (SI_ORDER_SECOND), meaning it is a sub-system of the
COPYRIGHT system and is in second place. All it will do is call the
print_caddr_t function and pass the trademark data to it. Simple as that.
Congratulations, you now know how to write a subsystem for the
FreeBSD kernel.
So let's now jump to something a tad more interesting. I mean, the
OpenBSD kernel did so much, where is all that stuff here then?
The big chomper
There it is again…=process0=, the ancestor to all processes. Here it is treated as a "system". It is a tad long, so I will cut some parts out and mark them with .
/* * The two following SYSINIT's are proc0 specific glue code. I am not * convinced that they can not be safely combined, but their order of * operation has been maintained as the same as the original init_main.c * for right now. */ /* ARGSUSED*/ static void proc0_init(void *dummy __unused) { struct proc *p; struct thread *td; struct ucred *newcred; struct uidinfo tmpuinfo; struct loginclass tmplc = { .lc_name = "", }; vm_paddr_t pageablemem; int i;
Skipping past declarations, but just adding them here for completeness.
GIANT_REQUIRED; p = &proc0; td = &thread0;
GIANT_REQUIRED makes sure that we have a our giant lock. This is
pretty much the same as the KERNEL_LOCK over on OpenBSD.
And also very similar, we grab our process and kernel thread. Same
thing as in OpenBSD again, just a slight different naming (proc /
thread vs. proc / process).
/* * Initialize magic number and osrel. */ p->p_magic = P_MAGIC; p->p_osrel = osreldate; /* * Initialize thread and process structures. */ procinit(); /* set up proc zone */ threadinit(); /* set up UMA zones */ /* * Initialise scheduler resources. * Add scheduler specific parts to proc, thread as needed. */ schedinit(); /* scheduler gets its house in order */
This is similar again. We set up the proc data and initialize
memory pools. UMA is similar to OpenBSD's pool, predefined
memory areas to just grab fixed size items from. Faster, easier,
simpler.
Then we initialize the scheduler. In FreeBSD, there are actually
two different ones. ULE (the newer and default one) and 4BSD (the
older one). Which one gets initialized depends on which one has been
compiled in.
/* * Create process 0. */ LIST_INSERT_HEAD(&allproc, p, p_list); LIST_INSERT_HEAD(PIDHASH(0), p, p_hash); mtx_init(&pgrp0.pg_mtx, "process group", NULL, MTX_DEF | MTX_DUPOK); sx_init(&pgrp0.pg_killsx, "killpg racer"); p->p_pgrp = &pgrp0; LIST_INSERT_HEAD(PGRPHASH(0), &pgrp0, pg_hash); LIST_INIT(&pgrp0.pg_members); LIST_INSERT_HEAD(&pgrp0.pg_members, p, p_pglist); pgrp0.pg_session = &session0; mtx_init(&session0.s_mtx, "session", NULL, MTX_DEF); refcount_init(&session0.s_count, 1); session0.s_leader = p;
Again, similar. We set up the data for process0. Only slight
difference here is that FreeBSD already uses a lock for the process
group instead of the big giant lock (the mtx_init call).
// ... td->td_cpuset = cpuset_thread0(); // ... strncpy(p->p_comm, "kernel", sizeof (p->p_comm)); strncpy(td->td_name, "kernel", sizeof (td->td_name)); // ...
Just a tiny difference here in the name. While OpenBSD stuck with
the historical "swapper" name, here it's more aptly called
"kernel".
/* Create credentials. */ newcred = crget(); newcred->cr_ngroups = 1; /* group 0 */ // ... /* Create sigacts. */ p->p_sigacts = sigacts_alloc(); /* Initialize signal state for process 0. */ siginit(&proc0); /* Create the file descriptor table. */ p->p_pd = pdinit(NULL, false); p->p_fd = fdinit(); p->p_fdtol = NULL; /* Create the limits structures. */ p->p_limit = lim_alloc(); // ... /* Initialize resource accounting structures. */ racct_create(&p->p_racct); p->p_stats = pstats_alloc();
Pretty much the same as on OpenBSD. Set up credentials, signals,
limits, file descriptors…
/* Allocate a prototype map so we have something to fork. */ p->p_vmspace = &vmspace0; refcount_init(&vmspace0.vm_refcnt, 1); pmap_pinit0(vmspace_pmap(&vmspace0)); /* * proc0 is not expected to enter usermode, so there is no special * handling for sv_minuser here, like is done for exec_new_vmspace(). */ vm_map_init(&vmspace0.vm_map, vmspace_pmap(&vmspace0), p->p_sysent->sv_minuser, p->p_sysent->sv_maxuser);
Our process0 also needs virtual memory space, no surprise here. The
code looks slightly different, but semantically same as what OpenBSD
does.
/* * Call the init and ctor for the new thread and proc. We wait * to do this until all other structures are fairly sane. */ EVENTHANDLER_DIRECT_INVOKE(process_init, p); EVENTHANDLER_DIRECT_INVOKE(thread_init, td); #ifdef KDTRACE_HOOKS kdtrace_proc_ctor(p); kdtrace_thread_ctor(td); #endif EVENTHANDLER_DIRECT_INVOKE(process_ctor, p); EVENTHANDLER_DIRECT_INVOKE(thread_ctor, td); /* * Charge root for one process. */ (void)chgproccnt(p->p_ucred->cr_ruidinfo, 1, 0); PROC_LOCK(p); racct_add_force(p, RACCT_NPROC, 1); PROC_UNLOCK(p); } SYSINIT(p0init, SI_SUB_INTRINSIC, SI_ORDER_FIRST, proc0_init, NULL);
The interesting part here is
. process_init
here is a list. Other systems can register themselves in that list.
"Please call me when you are done with this". A simple callback
system basically.
So, any system that wanted to know when the process initialization has been completed now gets a callback. And with that, processes are initialized.
The init system
We wouldn't be complete without an init system. So we'll look at it
here too, but I went and cut out a lot of the noise, mostly to save on
space and time. If you are interested, feel free to go and check the
code yourself in sys/kern/init_main.c.
/* * Start the initial user process; try exec'ing each pathname in init_path. * The program is invoked with one argument containing the boot flags. */ static void start_init(void *dummy) { vfs_mountroot(); /* Wipe GELI passphrase from the environment. */ kern_unsetenv("kern.geom.eli.passphrase"); while ((path = strsep(&tmp_init_path, ":")) != NULL) { error = exec_alloc_args(&args); error = exec_args_add_fname(&args, path, UIO_SYSSPACE); error = exec_args_add_arg(&args, path, UIO_SYSSPACE); /* * Now try to exec the program. If can't for any reason * other than it doesn't exist, complain. * * Otherwise, return via fork_trampoline() all the way * to user mode as init! */ error = kern_execve(td, &args, NULL, oldvmspace); } panic("no init"); }
Cut down, it looks almost the same as the OpenBSD version. We mount
root, look for the init program and get it running.
/* * Like kproc_create(), but runs in its own address space. We do this * early to reserve pid 1. Note special case - do not make it * runnable yet, init execution is started when userspace can be served. */ static void create_init(const void *udata __unused) { fr.fr_procp = &initproc; error = fork1(&thread0, &fr); cpu_fork_kthread_handler(FIRST_THREAD_IN_PROC(initproc), start_init, NULL); } SYSINIT(init, SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL);
Same again. We prepare the fork, and once it's ready, it would go and
run start_init from above. But it will wait until the kthread
system is ready.
/* * Make it runnable now. */ static void kick_init(const void *udata __unused) { td = FIRST_THREAD_IN_PROC(initproc); TD_SET_CAN_RUN(td); sched_add(td, SRQ_BORING); } SYSINIT(kickinit, SI_SUB_KTHREAD_INIT, SI_ORDER_MIDDLE, kick_init, NULL);
And here it is. Somewhere in the middle of the kthread subsystem,
we kick init into gear and let it run. And yes, that flag there
really is called SRQ_BORING.
Conclusion
In its core, both, OpenBSD and FreeBSD follow the same outline
here. We create process0, wire it up with credentials, and virtual
memory, initialize our internal accounting for processes etc.
Not really surprising as they both come from the same lineage. Same bones, different meat-suit.
What I like in the OpenBSD one: super easy to follow and no
surprises. Just read the main function and follow it step by step
like a damn cooking recipe. Readable, auditable, goal achieved I'd say.
The FreeBSD version is a bit more all over the place. But, it is
more modular. A kld can use the same SYSINIT framework to run its
own initialization at load time, re-using into the same ordering
system. But then there's also the callbacks, etc. The goal
of extensibility and flexibility is achieved 100% here, with the
trade-off of readability.
Neither is better or worse. It comes down to what you need and having that choice is good. I love and run both systems with different use-cases.
Next up, … I don't know yet. Maybe some more kernel stuff, maybe
back to some binaries? Or maybe we take a look at PF's code.
Let me know over on Mastodon if there is a topic you'd be interested in. Or feedback. Or just yell at me for getting stuff wrong, the chances are very high I got a lot wrong :D