Daily Source Reading: pledge
pledge
Finally, not the boooooring stuff from /bin or /usr/bin!
(Joking of course, the bin stuff has been very interesting so far)
We will definitely go back to binaries at some point, but today is our
first time ever looking at some kernel stuff!
So far in our series is something that all of them had in common: all
OpenBSD binaries that we have looked at so far have used pledge.
And for good reason. It's a very good and simple to use mechanism to
prevent accidental foot-shooting.
If you don't know what pledge is, I highly recommend reading the man
page and then come back here.
So, let's dive in.
/* * Ordered in blocks starting with least risky and most required. */ const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = { /* * Minimum required */ [SYS_exit] = PLEDGE_ALWAYS, [SYS_kbind] = PLEDGE_ALWAYS, [SYS___get_tcb] = PLEDGE_ALWAYS, // ... [AND A LOT MORE OF THAT] ... /* "getting" information about self is considered safe */ [SYS_getuid] = PLEDGE_STDIO, [SYS_geteuid] = PLEDGE_STDIO, [SYS_getresuid] = PLEDGE_STDIO, [SYS_getgid] = PLEDGE_STDIO, [SYS_getegid] = PLEDGE_STDIO, [SYS_getresgid] = PLEDGE_STDIO, [SYS_getgroups] = PLEDGE_STDIO, // ... [AND A LOT MORE OF THAT] ... [SYS_fork] = PLEDGE_PROC, [SYS_vfork] = PLEDGE_PROC, [SYS_setpgid] = PLEDGE_PROC, [SYS_setsid] = PLEDGE_PROC, [SYS_setrlimit] = PLEDGE_PROC | PLEDGE_ID, [SYS_getpriority] = PLEDGE_PROC | PLEDGE_ID, [SYS_setpriority] = PLEDGE_PROC | PLEDGE_ID, [SYS_setuid] = PLEDGE_ID, // ... [AND A LOT MORE OF THAT] ... [SYS_lstat] = PLEDGE_RPATH | PLEDGE_WPATH | PLEDGE_TMPPATH, // ... [AND A LOT MORE OF THAT] ...
This is basically a mapping of every syscall to it's list (or
bitmask) of which pledge promises it requires to be set.
static const struct { char *name; uint64_t flags; } pledgereq[] = { { "audio", PLEDGE_AUDIO }, { "bpf", PLEDGE_BPF }, { "chown", PLEDGE_CHOWN | PLEDGE_CHOWNUID }, { "cpath", PLEDGE_CPATH }, { "disklabel", PLEDGE_DISKLABEL }, { "dns", PLEDGE_DNS }, { "dpath", PLEDGE_DPATH }, { "drm", PLEDGE_DRM }, { "error", PLEDGE_ERROR }, { "exec", PLEDGE_EXEC }, { "fattr", PLEDGE_FATTR | PLEDGE_CHOWN }, { "flock", PLEDGE_FLOCK }, { "getpw", PLEDGE_GETPW }, { "id", PLEDGE_ID }, { "inet", PLEDGE_INET }, { "mcast", PLEDGE_MCAST }, { "pf", PLEDGE_PF }, { "proc", PLEDGE_PROC }, { "prot_exec", PLEDGE_PROTEXEC }, { "ps", PLEDGE_PS }, { "recvfd", PLEDGE_RECVFD }, { "route", PLEDGE_ROUTE }, { "rpath", PLEDGE_RPATH }, { "sendfd", PLEDGE_SENDFD }, { "settime", PLEDGE_SETTIME }, { "stdio", PLEDGE_STDIO }, { "tape", PLEDGE_TAPE }, { "tmppath", PLEDGE_TMPPATH }, { "tty", PLEDGE_TTY }, { "unix", PLEDGE_UNIX }, { "unveil", PLEDGE_UNVEIL }, { "video", PLEDGE_VIDEO }, { "vminfo", PLEDGE_VMINFO }, { "vmm", PLEDGE_VMM }, { "wpath", PLEDGE_WPATH }, { "wroute", PLEDGE_WROUTE }, };
Here we have a simpe array that maps the strings (yes, pledge takes
a long string that defines what it promises instead of bitmasks) to
their bitmask values. Not entirely sure why the pledge call doesn't
immediately use bitmasks, but I am very sure there are reasons. If
anyone knows, hit me up on Mastodon.
Note however, that the list is alphabetically sorted. And that's not just for aesthetics as we will see in a bit.
(Update 1: as has been pointed out to me by on Mastodon, the original design was actually using bitmasks: tame(2))
(Update 2: here's an explanation on the WaybackMachine)
int parsepledges(struct proc *p, const char *kname, const char *promises, u_int64_t *fp) { size_t rbuflen; char *rbuf, *rp, *pn; u_int64_t flags = 0, f; int error; rbuf = malloc(MAXPATHLEN, M_TEMP, M_WAITOK); error = copyinstr(promises, rbuf, MAXPATHLEN, &rbuflen); if (error) { free(rbuf, M_TEMP, MAXPATHLEN); return (error); } #ifdef KTRACE if (KTRPOINT(p, KTR_STRUCT)) ktrstruct(p, kname, rbuf, rbuflen-1); #endif for (rp = rbuf; rp && *rp; rp = pn) { pn = strchr(rp, ' '); /* find terminator */ if (pn) { while (*pn == ' ') *pn++ = '\0'; } if ((f = pledgereq_flags(rp)) == 0) { free(rbuf, M_TEMP, MAXPATHLEN); return (EINVAL); } flags |= f; } free(rbuf, M_TEMP, MAXPATHLEN); *fp = flags; return 0; }
This is the function that will split "stdio exec" into its
components and create the bitmask out of it. Split on whitespaces,
then retrieve the flags for each promise via pledgereq_flags:
/* bsearch over pledgereq. return flags value if found, 0 else */ uint64_t pledgereq_flags(const char *req_name) { int base = 0, cmp, i, lim; for (lim = nitems(pledgereq); lim != 0; lim >>= 1) { i = base + (lim >> 1); cmp = strcmp(req_name, pledgereq[i].name); if (cmp == 0) return (pledgereq[i].flags); if (cmp > 0) { /* not found before, move right */ base = i + 1; lim--; } /* else move left */ } return (0); }
And there's the reason the list was alphabetically sorted. So we can run a binary search on it. The list isn't too long, so it wouldn't be that expensive to run a linear search, but good to have.
The nitty gritty of promising something
First we need to promise that we will behave. For that, OpenBSD has
the pledge syscall (or sys_pledge as its implementation is named).
So, anytime you for example run pledge("stdio rpath", NULL), this is
where we will end up.
int sys_pledge(struct proc *p, void *v, register_t *retval) { struct sys_pledge_args /* { syscallarg(const char *)promises; syscallarg(const char *)execpromises; } */ *uap = v; struct process *pr = p->p_p; uint64_t promises, execpromises; int error = 0; int unveil_cleanup = 0; /* Check for any error in user input */ if (SCARG(uap, promises)) { error = parsepledges(p, "pledgereq", SCARG(uap, promises), &promises); if (error) return (error); } if (SCARG(uap, execpromises)) { error = parsepledges(p, "pledgeexecreq", SCARG(uap, execpromises), &execpromises); if (error) return (error); } mtx_enter(&pr->ps_mtx);
To promise something, we first do some house-cleaning and check if the
promises (e.g. "stdio exec rpath" etc.) are correct (just checking
for parsing here).
/* Check for any error wrt current promises */ if (SCARG(uap, promises)) { /* In "error" mode, ignore promise increase requests, * but accept promise decrease requests */ if (ISSET(pr->ps_flags, PS_PLEDGE) && (pr->ps_pledge & PLEDGE_ERROR)) promises &= (pr->ps_pledge & PLEDGE_USERSET); /* Only permit reductions */ if (ISSET(pr->ps_flags, PS_PLEDGE) && (((promises | pr->ps_pledge) != pr->ps_pledge))) { error = EPERM; goto fail; } } if (SCARG(uap, execpromises)) { /* Only permit reductions */ if (ISSET(pr->ps_flags, PS_EXECPLEDGE) && (((execpromises | pr->ps_execpledge) != pr->ps_execpledge))) { error = EPERM; goto fail; } }
Here, we check if only reductions are applied. If a process pledges
to only allow stdio, it cannot later pledge stdio exec.
If it's a promise and not an executionpromise (if you don't know
the difference, I'll cheekily refer you to the man page. I won't
handhold through everything.), then we also add a quick check if the
error promise is set. In which case we ignore an attempted
increase.
/* Set up promises */ if (SCARG(uap, promises)) { pr->ps_pledge = promises; atomic_setbits_int(&pr->ps_flags, PS_PLEDGE); if ((pr->ps_pledge & (PLEDGE_RPATH | PLEDGE_WPATH | PLEDGE_CPATH | PLEDGE_DPATH | PLEDGE_TMPPATH | PLEDGE_EXEC | PLEDGE_UNIX | PLEDGE_UNVEIL)) == 0) unveil_cleanup = 1; } if (SCARG(uap, execpromises)) { pr->ps_execpledge = execpromises; atomic_setbits_int(&pr->ps_flags, PS_EXECPLEDGE); }
We have reached the core of it: we set the process' pledge mask to our promises and turn on the flag for the process that promises are set.
Once none of our resulting promises (after reducing them) have
anything left that accesses paths (like rpath, wpath or cpath),
we need to make sure that we clean up any unveil related work, as we
don't need it anymore (we might look at unveil soon, so just be
patient, we will get around to it). Don't dilly-dally until sometime
later, clean it up now.
fail: mtx_leave(&pr->ps_mtx); if (unveil_cleanup) { /* * Kill off unveil and drop unveil vnode refs if we no * longer are holding any path-accessing pledge. This * must be done single-threaded, because another thread * may be in a system call sleeping in namei(). */ single_thread_set(p, SINGLE_UNWIND); KERNEL_LOCK(); unveil_destroy(pr); KERNEL_UNLOCK(); single_thread_clear(p); } return (error); }
Lastly, we do any of the aforementioned unveil cleanup if needed and
return.
Making sure promises are kept
Now, making promises is easy. Keeping them is harder. At least
that's how the saying goes. Making sure that the promises to pledge
are kept, is actually not too bad in its implementation and less scary
than one would think.
int pledge_syscall(struct proc *p, int code, uint64_t *tval) { p->p_pledge_syscall = code; *tval = 0; if (code < 0 || code > SYS_MAXSYSCALL - 1) return (EINVAL); if (pledge_syscalls[code] == PLEDGE_ALWAYS) return (0); p->p_pledge = READ_ONCE(p->p_p->ps_pledge); /* pledge checks are per-thread */ if (p->p_pledge & pledge_syscalls[code]) return (0); *tval = pledge_syscalls[code]; return (EPERM); }
This call will run a lot, so the shorter, the better.
Basically we check 3 things:
- First we check if the syscall code is actually within the allowed range
- Then we short-circuit if the syscall in question is of type
PLEDGE_ALWAYS, so we don't need to check if specific bits are set - Then, we check with
p->p_pledge & pledge_syscalls[code]if our bitmasks match up and allow the same promises
At the end, if we haven't early exited yet, something went wrong and
the syscall we attempted was not allowed. We then store the pledge
mask for that syscall code in tval. We will come to that in a
moment.
But we skipped an important line:
p->p_pledge = READ_ONCE(p->p_p->ps_pledge); /* pledge checks are per-thread */
As this is kernel code, we have to pay a bit more attention. Let's disect it a bit more carefully then:
pis our kernel thread, and it has a cache for the pledge maskp->p_pis the owning (not parent!) process for the threadp->p_p->ps_pledgeis the process' pledge mask
We read it once and then cache it in the kernel thread. We do this,
because another thread could potentially alter our mask concurrently.
That would be bad. This is because pledge_syscall is not the only
place this mask gets checked.
There are more refined checks in the source, like pledge_namei or
pledge_ioctl. They all first have to pass the pledge_syscall
check and then can move on. Without that snapshot of the mask, we
would check with a somehow inconsistent set of promises.
We won't dive into the implementation of pledge_namei and the
others, that would be too much for a blog post.
When promises are broken
So what happens when a process breaks one of its promises? That's
where pledge_fail comes in. This is the last function we should
take a look at before we wrap it up.
int pledge_fail(struct proc *p, int error, uint64_t code) { const char *codes = ""; int i; /* Print first matching pledge */ for (i = 0; code && pledgenames[i].bits != 0; i++) if (pledgenames[i].bits & code) { codes = pledgenames[i].name; break; }
First, we figure out which promise was violated. Remember tval from
pledge_syscall? That's where we saved the bitmask of promises that
would have been required (remember that long array mapping syscall
to bitmasks in the beginning?). We loop over the pledgenames array
to find the first name that matches, so we have something that we can
print that is understandable and not just a weird looking bitmask.
#ifdef KTRACE if (KTRPOINT(p, KTR_PLEDGE)) ktrpledge(p, error, code, p->p_pledge_syscall); #endif if (p->p_pledge & PLEDGE_ERROR) return (ENOSYS);
If ktrace is active, we log the violation for debugging.
Then, the error promise gets its moment. If the process pledged
"error", we just return ENOSYS and let the process live. No
drama, it shall live for another day (again, man pledge is your best
friend here). The process can handle the error and move on.
If "error" is not set though, things get serious:
KERNEL_LOCK(); uprintf("%s[%d]: pledge \"%s\", syscall %d\n", p->p_p->ps_comm, p->p_p->ps_pid, codes, p->p_pledge_syscall); p->p_p->ps_acflag |= APLEDGE;
First we print a diagnostic to the console. If you ever did something
that you didn't pledge in your OpenBSD code, the system will stop
you hard in your tracks.
Quick sideshow:
#include <unistd.h> int main(void) { pledge("stdio", NULL); open("/etc/passwd", 0); return 0; }
rootnode /tmp $ cc fail.c -o fail rootnode /tmp $ ./fail fail[67014]: pledge "rpath", syscall 5 Abort trap (core dumped) (134) rootnode /tmp $
Very helpful for figuring out what you forgot to promise, but the kernel does very quickly shoot you in the face instead of allowing anything that you didn't promise.
And if accounting is enabled, we can even see our failed test program
in the output of lastcomm:
ksh[78314] -F rootnode ttyp1 0.00 secs Fri Feb 13 01:03 (0:00:00.00) printf[51819] - rootnode ttyp1 0.00 secs Fri Feb 13 01:03 (0:00:00.00) fail[32547] -DXP rootnode ttyp1 0.00 secs Fri Feb 13 01:03 (0:00:00.00) ksh[26780] -F rootnode ttyp1 0.00 secs Fri Feb 13 01:03 (0:00:00.02)
That is thanks to that accounting flag mentioned above and confirms what we read in the man page.
Ok, back to the rest:
I guess, ps_acflags are some accounting flags, just to keep track of
some info about processes. Reading the man page, this should be for
this case:
#+beginexample
a process that was terminated due to a pledge violation is accounted
by lastcomm(1) with the `P' flag.
#+endexample Then the accounting flag APLEDGE is set – this marks
/* Try to stop threads immediately, because this process is suspect */ if (P_HASSIBLING(p)) single_thread_set(p, SINGLE_UNWIND | SINGLE_DEEP); /* Send uncatchable SIGABRT for coredump */ sigabort(p); p->p_p->ps_pledge = 0; /* Disable all PLEDGE_ flags */ KERNEL_UNLOCK(); return (error); }
If the process has other threads, we try to stop them immediately. The process is suspect at this point, as the comment says. No point in letting sibling threads keep running while the kernel is about to shoot the entire process in the face.
Finally, the gun goes off with sigabort and we clear the pledge
mask, so any other pledge_syscalls don't trigger anymore.
Conclusion
For such an extremely powerful tool, the implementation is astonishingly small and lightweight (compared to a lot of other kernel code).
It's not just that pledge is easy to use, but its implementation
is also easy. This should honestly be implemented in this
simplicity everywhere.
I know that other systems have similar tools (seccomp or AppArmor
on Linux, capsicum on FreeBSD). But seccomp with eBPF is
somehow Turing-complete…*WHY!?*. And capsicum, although nice,
requires a bit more of a restructuring of existing code, as it is file
descriptor based and they all have to be setup before calling
cap_enter. It works fabulously if integrated into the code from the
start, but difficult to add to existing code. (At some point in the
future, we might actually go take a look at capsicum, but I fear the
code for that might be a bit longer.)
pledge on the other hand…very easy to throw into existing code.
Your code is only supposed to read files and maybe open a socket?
Cool, pledge that and keep "exec" out of it, and it's already a
lot more difficult to exploit your program and execute a shell.
That concludes today's code reading. For the next one, I think it
makes a lot of sense if we take a look at pledge's partner in crime:
unveil.
And as usual: always happy to hear any feedback or suggestions over on Mastodon.