Daily Source Reading: `pledge`

`pledge`

Finally, not the boooooring stuff from /bin or /usr/bin! (Joking of course, the bin stuff has been very interesting so far) We will definitely go back to binaries at some point, but today is our first time ever looking at some kernel stuff!

So far in our series is something that all of them had in common: all OpenBSD binaries that we have looked at so far have used pledge. And for good reason. It's a very good and simple to use mechanism to prevent accidental foot-shooting.

If you don't know what pledge is, I highly recommend reading the man page and then come back here.

So, let's dive in.

/*
 * Ordered in blocks starting with least risky and most required.
 */
const uint64_t pledge_syscalls[SYS_MAXSYSCALL] = {
    /*
     * Minimum required
     */
    [SYS_exit] = PLEDGE_ALWAYS,
    [SYS_kbind] = PLEDGE_ALWAYS,
    [SYS___get_tcb] = PLEDGE_ALWAYS,
    // ... [AND A LOT MORE OF THAT] ...
    /* "getting" information about self is considered safe */
    [SYS_getuid] = PLEDGE_STDIO,
    [SYS_geteuid] = PLEDGE_STDIO,
    [SYS_getresuid] = PLEDGE_STDIO,
    [SYS_getgid] = PLEDGE_STDIO,
    [SYS_getegid] = PLEDGE_STDIO,
    [SYS_getresgid] = PLEDGE_STDIO,
    [SYS_getgroups] = PLEDGE_STDIO,
    // ... [AND A LOT MORE OF THAT] ...
    [SYS_fork] = PLEDGE_PROC,
    [SYS_vfork] = PLEDGE_PROC,
    [SYS_setpgid] = PLEDGE_PROC,
    [SYS_setsid] = PLEDGE_PROC,

    [SYS_setrlimit] = PLEDGE_PROC | PLEDGE_ID,
    [SYS_getpriority] = PLEDGE_PROC | PLEDGE_ID,

    [SYS_setpriority] = PLEDGE_PROC | PLEDGE_ID,

    [SYS_setuid] = PLEDGE_ID,
    // ... [AND A LOT MORE OF THAT] ...
    [SYS_lstat] = PLEDGE_RPATH | PLEDGE_WPATH | PLEDGE_TMPPATH,
    // ... [AND A LOT MORE OF THAT] ...

This is basically a mapping of every syscall to it's list (or bitmask) of which pledge promises it requires to be set.

static const struct {
    char *name;
    uint64_t flags;
} pledgereq[] = {
    { "audio",      PLEDGE_AUDIO },
    { "bpf",        PLEDGE_BPF },
    { "chown",      PLEDGE_CHOWN | PLEDGE_CHOWNUID },
    { "cpath",      PLEDGE_CPATH },
    { "disklabel",      PLEDGE_DISKLABEL },
    { "dns",        PLEDGE_DNS },
    { "dpath",      PLEDGE_DPATH },
    { "drm",        PLEDGE_DRM },
    { "error",      PLEDGE_ERROR },
    { "exec",       PLEDGE_EXEC },
    { "fattr",      PLEDGE_FATTR | PLEDGE_CHOWN },
    { "flock",      PLEDGE_FLOCK },
    { "getpw",      PLEDGE_GETPW },
    { "id",         PLEDGE_ID },
    { "inet",       PLEDGE_INET },
    { "mcast",      PLEDGE_MCAST },
    { "pf",         PLEDGE_PF },
    { "proc",       PLEDGE_PROC },
    { "prot_exec",      PLEDGE_PROTEXEC },
    { "ps",         PLEDGE_PS },
    { "recvfd",     PLEDGE_RECVFD },
    { "route",      PLEDGE_ROUTE },
    { "rpath",      PLEDGE_RPATH },
    { "sendfd",     PLEDGE_SENDFD },
    { "settime",        PLEDGE_SETTIME },
    { "stdio",      PLEDGE_STDIO },
    { "tape",       PLEDGE_TAPE },
    { "tmppath",        PLEDGE_TMPPATH },
    { "tty",        PLEDGE_TTY },
    { "unix",       PLEDGE_UNIX },
    { "unveil",     PLEDGE_UNVEIL },
    { "video",      PLEDGE_VIDEO },
    { "vminfo",     PLEDGE_VMINFO },
    { "vmm",        PLEDGE_VMM },
    { "wpath",      PLEDGE_WPATH },
    { "wroute",     PLEDGE_WROUTE },
};

Here we have a simpe array that maps the strings (yes, pledge takes a long string that defines what it promises instead of bitmasks) to their bitmask values. Not entirely sure why the pledge call doesn't immediately use bitmasks, but I am very sure there are reasons. If anyone knows, hit me up on Mastodon.

Note however, that the list is alphabetically sorted. And that's not just for aesthetics as we will see in a bit.

(Update 1: as has been pointed out to me by on Mastodon, the original design was actually using bitmasks: tame(2))

(Update 2: here's an explanation on the WaybackMachine)

int
parsepledges(struct proc *p, const char *kname, const char *promises, u_int64_t *fp)
{
    size_t rbuflen;
    char *rbuf, *rp, *pn;
    u_int64_t flags = 0, f;
    int error;

    rbuf = malloc(MAXPATHLEN, M_TEMP, M_WAITOK);
    error = copyinstr(promises, rbuf, MAXPATHLEN, &rbuflen);
    if (error) {
        free(rbuf, M_TEMP, MAXPATHLEN);
        return (error);
    }
#ifdef KTRACE
    if (KTRPOINT(p, KTR_STRUCT))
        ktrstruct(p, kname, rbuf, rbuflen-1);
#endif

    for (rp = rbuf; rp && *rp; rp = pn) {
        pn = strchr(rp, ' ');   /* find terminator */
        if (pn) {
            while (*pn == ' ')
                *pn++ = '\0';
        }
        if ((f = pledgereq_flags(rp)) == 0) {
            free(rbuf, M_TEMP, MAXPATHLEN);
            return (EINVAL);
        }
        flags |= f;
    }
    free(rbuf, M_TEMP, MAXPATHLEN);
    *fp = flags;
    return 0;
}

This is the function that will split "stdio exec" into its components and create the bitmask out of it. Split on whitespaces, then retrieve the flags for each promise via pledgereq_flags:

/* bsearch over pledgereq. return flags value if found, 0 else */
uint64_t
pledgereq_flags(const char *req_name)
{
    int base = 0, cmp, i, lim;

    for (lim = nitems(pledgereq); lim != 0; lim >>= 1) {
        i = base + (lim >> 1);
        cmp = strcmp(req_name, pledgereq[i].name);
        if (cmp == 0)
            return (pledgereq[i].flags);
        if (cmp > 0) { /* not found before, move right */
            base = i + 1;
            lim--;
        } /* else move left */
    }
    return (0);
}

And there's the reason the list was alphabetically sorted. So we can run a binary search on it. The list isn't too long, so it wouldn't be that expensive to run a linear search, but good to have.

The nitty gritty of promising something

First we need to promise that we will behave. For that, OpenBSD has the pledge syscall (or sys_pledge as its implementation is named).

So, anytime you for example run pledge("stdio rpath", NULL), this is where we will end up.

int
sys_pledge(struct proc *p, void *v, register_t *retval)
{
    struct sys_pledge_args /* {
        syscallarg(const char *)promises;
        syscallarg(const char *)execpromises;
    } */    *uap = v;
    struct process *pr = p->p_p;
    uint64_t promises, execpromises;
    int error = 0;
    int unveil_cleanup = 0;

    /* Check for any error in user input */
    if (SCARG(uap, promises)) {
        error = parsepledges(p, "pledgereq",
            SCARG(uap, promises), &promises);
        if (error)
            return (error);
    }
    if (SCARG(uap, execpromises)) {
        error = parsepledges(p, "pledgeexecreq",
            SCARG(uap, execpromises), &execpromises);
        if (error)
            return (error);
    }

    mtx_enter(&pr->ps_mtx);

To promise something, we first do some house-cleaning and check if the promises (e.g. "stdio exec rpath" etc.) are correct (just checking for parsing here).

/* Check for any error wrt current promises */
if (SCARG(uap, promises)) {
    /* In "error" mode, ignore promise increase requests,
     * but accept promise decrease requests */
    if (ISSET(pr->ps_flags, PS_PLEDGE) &&
        (pr->ps_pledge & PLEDGE_ERROR))
        promises &= (pr->ps_pledge & PLEDGE_USERSET);

    /* Only permit reductions */
    if (ISSET(pr->ps_flags, PS_PLEDGE) &&
        (((promises | pr->ps_pledge) != pr->ps_pledge))) {
        error = EPERM;
        goto fail;
    }
}
if (SCARG(uap, execpromises)) {
    /* Only permit reductions */
    if (ISSET(pr->ps_flags, PS_EXECPLEDGE) &&
        (((execpromises | pr->ps_execpledge) != pr->ps_execpledge))) {
        error = EPERM;
        goto fail;
    }
}

Here, we check if only reductions are applied. If a process pledges to only allow stdio, it cannot later pledge stdio exec.

If it's a promise and not an executionpromise (if you don't know the difference, I'll cheekily refer you to the man page. I won't handhold through everything.), then we also add a quick check if the error promise is set. In which case we ignore an attempted increase.

/* Set up promises */
if (SCARG(uap, promises)) {
    pr->ps_pledge = promises;
    atomic_setbits_int(&pr->ps_flags, PS_PLEDGE);

    if ((pr->ps_pledge & (PLEDGE_RPATH | PLEDGE_WPATH |
        PLEDGE_CPATH | PLEDGE_DPATH | PLEDGE_TMPPATH | PLEDGE_EXEC |
        PLEDGE_UNIX | PLEDGE_UNVEIL)) == 0)
        unveil_cleanup = 1;
}
if (SCARG(uap, execpromises)) {
    pr->ps_execpledge = execpromises;
    atomic_setbits_int(&pr->ps_flags, PS_EXECPLEDGE);
}

We have reached the core of it: we set the process' pledge mask to our promises and turn on the flag for the process that promises are set.

Once none of our resulting promises (after reducing them) have anything left that accesses paths (like rpath, wpath or cpath), we need to make sure that we clean up any unveil related work, as we don't need it anymore (we might look at unveil soon, so just be patient, we will get around to it). Don't dilly-dally until sometime later, clean it up now.

fail:
    mtx_leave(&pr->ps_mtx);

    if (unveil_cleanup) {
        /*
         * Kill off unveil and drop unveil vnode refs if we no
         * longer are holding any path-accessing pledge. This
         * must be done single-threaded, because another thread
         * may be in a system call sleeping in namei().
         */
        single_thread_set(p, SINGLE_UNWIND);
        KERNEL_LOCK();
        unveil_destroy(pr);
        KERNEL_UNLOCK();
        single_thread_clear(p);
    }
    return (error);
}

Lastly, we do any of the aforementioned unveil cleanup if needed and return.

Making sure promises are kept

Now, making promises is easy. Keeping them is harder. At least that's how the saying goes. Making sure that the promises to pledge are kept, is actually not too bad in its implementation and less scary than one would think.

int
pledge_syscall(struct proc *p, int code, uint64_t *tval)
{
    p->p_pledge_syscall = code;
    *tval = 0;

    if (code < 0 || code > SYS_MAXSYSCALL - 1)
        return (EINVAL);

    if (pledge_syscalls[code] == PLEDGE_ALWAYS)
        return (0);

    p->p_pledge = READ_ONCE(p->p_p->ps_pledge); /* pledge checks are per-thread */
    if (p->p_pledge & pledge_syscalls[code])
        return (0);

    *tval = pledge_syscalls[code];
    return (EPERM);
}

This call will run a lot, so the shorter, the better.

Basically we check 3 things:

First we check if the syscall code is actually within the allowed range
Then we short-circuit if the syscall in question is of type PLEDGE_ALWAYS, so we don't need to check if specific bits are set
Then, we check with p->p_pledge & pledge_syscalls[code] if our bitmasks match up and allow the same promises

At the end, if we haven't early exited yet, something went wrong and the syscall we attempted was not allowed. We then store the pledge mask for that syscall code in tval. We will come to that in a moment.

But we skipped an important line:

p->p_pledge = READ_ONCE(p->p_p->ps_pledge); /* pledge checks are per-thread */

As this is kernel code, we have to pay a bit more attention. Let's disect it a bit more carefully then:

p is our kernel thread, and it has a cache for the pledge mask
p->p_p is the owning (not parent!) process for the thread
p->p_p->ps_pledge is the process' pledge mask

We read it once and then cache it in the kernel thread. We do this, because another thread could potentially alter our mask concurrently. That would be bad. This is because pledge_syscall is not the only place this mask gets checked.

There are more refined checks in the source, like pledge_namei or pledge_ioctl. They all first have to pass the pledge_syscall check and then can move on. Without that snapshot of the mask, we would check with a somehow inconsistent set of promises.

We won't dive into the implementation of pledge_namei and the others, that would be too much for a blog post.

When promises are broken

So what happens when a process breaks one of its promises? That's where pledge_fail comes in. This is the last function we should take a look at before we wrap it up.

int
pledge_fail(struct proc *p, int error, uint64_t code)
{
    const char *codes = "";
    int i;

    /* Print first matching pledge */
    for (i = 0; code && pledgenames[i].bits != 0; i++)
        if (pledgenames[i].bits & code) {
            codes = pledgenames[i].name;
            break;
        }

First, we figure out which promise was violated. Remember tval from pledge_syscall? That's where we saved the bitmask of promises that would have been required (remember that long array mapping syscall to bitmasks in the beginning?). We loop over the pledgenames array to find the first name that matches, so we have something that we can print that is understandable and not just a weird looking bitmask.

#ifdef KTRACE
    if (KTRPOINT(p, KTR_PLEDGE))
        ktrpledge(p, error, code, p->p_pledge_syscall);
#endif
    if (p->p_pledge & PLEDGE_ERROR)
        return (ENOSYS);

If ktrace is active, we log the violation for debugging.

Then, the error promise gets its moment. If the process pledged "error", we just return ENOSYS and let the process live. No drama, it shall live for another day (again, man pledge is your best friend here). The process can handle the error and move on.

If "error" is not set though, things get serious:

KERNEL_LOCK();
uprintf("%s[%d]: pledge \"%s\", syscall %d\n",
    p->p_p->ps_comm, p->p_p->ps_pid, codes, p->p_pledge_syscall);
p->p_p->ps_acflag |= APLEDGE;

First we print a diagnostic to the console. If you ever did something that you didn't pledge in your OpenBSD code, the system will stop you hard in your tracks.

Quick sideshow:

#include <unistd.h>

int
main(void)
{
    pledge("stdio", NULL);
    open("/etc/passwd", 0);
    return 0;
}

rootnode /tmp $ cc fail.c -o fail
rootnode /tmp $ ./fail
fail[67014]: pledge "rpath", syscall 5
Abort trap (core dumped)
(134) rootnode /tmp $

Very helpful for figuring out what you forgot to promise, but the kernel does very quickly shoot you in the face instead of allowing anything that you didn't promise.

And if accounting is enabled, we can even see our failed test program in the output of lastcomm:

ksh[78314]                            -F      rootnode                         ttyp1      0.00 secs Fri Feb 13 01:03 (0:00:00.00)
printf[51819]                         -       rootnode                         ttyp1      0.00 secs Fri Feb 13 01:03 (0:00:00.00)
fail[32547]                           -DXP    rootnode                         ttyp1      0.00 secs Fri Feb 13 01:03 (0:00:00.00)
ksh[26780]                            -F      rootnode                         ttyp1      0.00 secs Fri Feb 13 01:03 (0:00:00.02)

That is thanks to that accounting flag mentioned above and confirms what we read in the man page.

Ok, back to the rest:

I guess, ps_acflags are some accounting flags, just to keep track of some info about processes. Reading the man page, this should be for this case:

#+begin_example a process that was terminated due to a pledge violation is accounted by lastcomm(1) with the `P' flag. #+end_example Then the accounting flag APLEDGE is set – this marks

    /* Try to stop threads immediately, because this process is suspect */
    if (P_HASSIBLING(p))
        single_thread_set(p, SINGLE_UNWIND | SINGLE_DEEP);

    /* Send uncatchable SIGABRT for coredump */
    sigabort(p);

    p->p_p->ps_pledge = 0;      /* Disable all PLEDGE_ flags */
    KERNEL_UNLOCK();
    return (error);
}

If the process has other threads, we try to stop them immediately. The process is suspect at this point, as the comment says. No point in letting sibling threads keep running while the kernel is about to shoot the entire process in the face.

Finally, the gun goes off with sigabort and we clear the pledge mask, so any other pledge_syscalls don't trigger anymore.

Conclusion

For such an extremely powerful tool, the implementation is astonishingly small and lightweight (compared to a lot of other kernel code).

It's not just that pledge is easy to use, but its implementation is also easy. This should honestly be implemented in this simplicity everywhere.

I know that other systems have similar tools (seccomp or AppArmor on Linux, capsicum on FreeBSD). But seccomp with eBPF is somehow Turing-complete…*WHY!?*. And capsicum, although nice, requires a bit more of a restructuring of existing code, as it is file descriptor based and they all have to be setup before calling cap_enter. It works fabulously if integrated into the code from the start, but difficult to add to existing code. (At some point in the future, we might actually go take a look at capsicum, but I fear the code for that might be a bit longer.)

pledge on the other hand…very easy to throw into existing code. Your code is only supposed to read files and maybe open a socket? Cool, pledge that and keep "exec" out of it, and it's already a lot more difficult to exploit your program and execute a shell.

That concludes today's code reading. For the next one, I think it makes a lot of sense if we take a look at pledge's partner in crime: unveil.

And as usual: always happy to hear any feedback or suggestions over on Mastodon.