Daily Source Reading: `unveil`

`unveil`

As promised at the end of the yesterday's article, today we are going to take a look at unveil, the spiritual sibling to pledge.

Uncloaking the files

int
sys_unveil(struct proc *p, void *v, register_t *retval)
{
    struct sys_unveil_args /* {
        syscallarg(const char *) path;
        syscallarg(const char *) permissions;
    } */ *uap = v;
    struct process *pr = p->p_p;
    char *pathname, *c;
    struct nameidata nd;
    size_t pathlen;
    char permissions[5];
    int error, allow;

A typical start for a syscall, forcing a void pointer into the arg structure and declaring all the variables we need.

if (SCARG(uap, path) == NULL && SCARG(uap, permissions) == NULL) {
    pr->ps_uvdone = 1;
    return (0);
}

if (pr->ps_uvdone != 0)
    return EPERM;

Should both arguments to unveil be NULL, it means we are done calling unveil. So we set the process' unveil done flag and exit. A similar affect could be achieved if we would just call pledge without "unveil" given to it.

If the flag is already not 0, then we return with a permissions error.

error = copyinstr(SCARG(uap, permissions), permissions,
    sizeof(permissions), NULL);
if (error)
    return (error);

This is a part I kinda ignored yesterday and just hopped over this copyinstr. Today I finally looked it up, and it is to copy a string from user address space to kernel space. Makes sense, as we would want to read the string arguments to unveil and we are in kernel land.

/*
 * System calls in other threads may sleep between unveil
 * datastructure inspections -- this is the simplest way to
 * provide consistency 
 */
single_thread_set(p, SINGLE_UNWIND);

pathname = pool_get(&namei_pool, PR_WAITOK);
error = copyinstr(SCARG(uap, path), pathname, MAXPATHLEN, &pathlen);
if (error)
    goto end;

The comment explains the first part. Pause other threads in this process, just to be safe.

Then we go on and get us some memory for a pathname from the namei pool (a resource pool for, have a guess….pathnames, that's right). Once we got our memory, we copy in the pathname argument from user space.

#ifdef KTRACE
    if (KTRPOINT(p, KTR_STRUCT))
        ktrstruct(p, "unveil", permissions, strlen(permissions));
#endif

Let's hop over the diagnostics here.

if (pathlen < 2) {
    error = EINVAL;
    goto end;
}

/* find root "/" or "//" */
for (c = pathname; *c != '\0'; c++) {
    if (*c != '/')
        break;
}
if (*c == '\0')
    /* root directory */
    NDINIT(&nd, LOOKUP, FOLLOW | LOCKLEAF | SAVENAME,
        UIO_SYSSPACE, pathname, p);
else
    NDINIT(&nd, CREATE, FOLLOW | LOCKLEAF | LOCKPARENT | SAVENAME,
        UIO_SYSSPACE, pathname, p);

This is a quick check if we are in a root path. Non-root paths get a CREATE flag, because the path we are unveiling might not exist yet.

nd.ni_pledge = PLEDGE_UNVEIL;
if ((error = namei(&nd)) != 0)
    goto end;

Now that we have our namei data, we go ahead and use namei to convert it into a vnode.

*(Interlude: I have to point out here how magnificent the man-pages are. Instead of having to go and find the function in the source to understand what it does, I literally just type man namei and get a detailed explanation. This is what makes reading the BSD code so fun, especially OpenBSD.)

/*
 * XXX Any access to the file or directory will allow us to
 * pledge path it
*/
allow = ((nd.ni_vp &&
    (VOP_ACCESS(nd.ni_vp, VREAD, p->p_ucred, p) == 0 ||
    VOP_ACCESS(nd.ni_vp, VWRITE, p->p_ucred, p) == 0 ||
    VOP_ACCESS(nd.ni_vp, VEXEC, p->p_ucred, p) == 0)) ||
    (nd.ni_dvp &&
    (VOP_ACCESS(nd.ni_dvp, VREAD, p->p_ucred, p) == 0 ||
    VOP_ACCESS(nd.ni_dvp, VWRITE, p->p_ucred, p) == 0 ||
    VOP_ACCESS(nd.ni_dvp, VEXEC, p->p_ucred, p) == 0)));

What a nice multi-lined boolean expression… but what it boils down to is that we check that the process actually got access to the vnode or its parent directory. Any kind of access. Should we for example set unveil to "you can read this path" but later it turns out that we don't have the permissions to actually do so and can only execute for example, that is not unveil's problem to deal with.

unveil only concerns itself if we can in any form or fashion access that vnode.

/* release lock from namei, but keep ref */
if (nd.ni_vp)
    VOP_UNLOCK(nd.ni_vp);
if (nd.ni_dvp && nd.ni_dvp != nd.ni_vp)
    VOP_UNLOCK(nd.ni_dvp);

if (allow)
    error = unveil_add(p, &nd, permissions);
else
    error = EPERM;

Our NDINIT earlier actually created a lock on the namei data, because we passed LOCKLEAF to it. So we unlock it (not release), but as the comment says, we hold on to the namei data for now to hold a reference.

We hold on to it, because we are not done with our unveil call and don't want that data to be fully released yet.

IF our check earlier (if access is allowed) was successful, we finally pass our path to unveil_add (just wait a moment, we'll get to that one).

    /* release vref from namei, but not vref from unveil_add */
    if (nd.ni_vp)
        vrele(nd.ni_vp);
    if (nd.ni_dvp)
        vrele(nd.ni_dvp);

    pool_put(&namei_pool, nd.ni_cnd.cn_pnbuf);
end:
    pool_put(&namei_pool, pathname);

    single_thread_clear(p);
    return (error);
}

We are done. So we can now release the nodes as unveil_add now holds on to its own data. And because we are good citizens, we also yield back the data we allocated from the pool.

Behind the veil, ehm, curtain

Alright, strap in, this is a bit of a long one. Grab some water, some rations, and off we go.

This is what our syscall that we just looked at does at the end. It takes the namei data (so, the path basically) and the permissions we wanted and adds it to the process' unveil data.

int
unveil_add(struct proc *p, struct nameidata *ndp, const char *permissions)
{
    struct process *pr = p->p_p;
    struct vnode *vp;
    struct unveil *uv;
    int directory_add;
    int ret = EINVAL;
    u_char flags;

Booooooring boilerplate, … skip (but a good reference to have while reading).

KASSERT(ISSET(ndp->ni_cnd.cn_flags, HASBUF)); /* must have SAVENAME */

if (unveil_parsepermissions(permissions, &flags) == -1)
    goto done;

A quick check that our NDINIT call earlier really did succeed on the SAVENAME. We need to be sure that the pathname is stored in the namei data.

Then we parse the permissions (in the current version, a string with any of the characters "rwxc"). This is straightforward, so we are not even going to take a look, but if you are interested, you absolutely should. One thing we have learned in this series so far is that even boring functions can end up being interesting.

if (pr->ps_uvpaths == NULL) {
    pr->ps_uvpaths = mallocarray(UNVEIL_MAX_VNODES,
        sizeof(struct unveil), M_PROC, M_WAITOK|M_ZERO);
}

if (pr->ps_uvvcount >= UNVEIL_MAX_VNODES ||
    pr->ps_uvncount >= UNVEIL_MAX_NAMES) {
    ret = E2BIG;
    goto done;
}

Right…so ps_uvpaths seems to be where we store all the unveiled paths. If we never call unveil, there's no reason to carry that data. But once we do, we allocate as many as allowed (UNVEIL_MAX_VNODES is 128 at the time of this writing).

Some additional error checking to make sure we are not overshooting the size and we are good to go.

That was the warm-up and everything is prepared now. Finally time for the good stuff.

/* Are we a directory? or something else */
directory_add = ndp->ni_vp != NULL && ndp->ni_vp->v_type == VDIR;

if (directory_add)
    vp = ndp->ni_vp;
else
    vp = ndp->ni_dvp;

KASSERT(vp->v_type == VDIR);
vref(vp);
vp->v_uvcount++;

The comment here helps. We check if we are unveiling a directory or a terminal node (say instead of ).

If it is a directory, we grab ni_vp, which is the direct vnode for that directory. If it is a terminal node, we grab the node for its parent directory including the terminal name (so the node for and as a name).

Then we hold a reference to it to make sure it doesn't just vanish and we increase the number of unveils for the vnode.

if ((uv = unveil_lookup(vp, pr, NULL)) != NULL) {
    /*
     * We already have unveiled this directory
     * vnode
     */
    vp->v_uvcount--;
    vrele(vp);

After checking if we already have an unveil entry for that node, we can go and drop the extra reference we just took.

But, there are some special cases to consider, even if we already have that node.

/*
 * If we are adding a directory which was already
 * unveiled containing only specific terminals,
 * unrestrict it.
 */
if (directory_add) {
    DPRINTF("unveil: %s(%d): updating directory vnode %p"
        " to unrestricted uvcount %d\n",
        pr->ps_comm, pr->ps_pid, vp, vp->v_uvcount);

    if (!unveil_setflags(&uv->uv_flags, flags))
        ret = EPERM;
    else
        ret = 0;
    goto done;
}

Say we already called before. That means we already have a vnode for /etc with the "pf.conf" terminal. But if we then call , it's the same node, but with ALL of the directory.

So we need to widen the scope of the unveiled area.

The other case:

/*
 * If we are adding a terminal that is already unveiled, just
 * replace the flags and we are done
 */
if (!directory_add) {
    struct unvname *tname;
    if ((tname = unveil_namelookup(uv,
        ndp->ni_cnd.cn_nameptr)) != NULL) {
        DPRINTF("unveil: %s(%d): changing flags for %s"
            "in vnode %p, uvcount %d\n",
            pr->ps_comm, pr->ps_pid, tname->un_name, vp,
            vp->v_uvcount);

        if (!unveil_setflags(&tname->un_flags, flags))
            ret = EPERM;
        else
            ret = 0;
        goto done;
    }
}

Say after our previous call, we try to (adding "w" to it), that is similar to pledge: widening permissions is not allowed.

} else {
    /*
     * New unveil involving this directory vnode.
     */
    uv = unveil_add_vnode(p, vp);
}

If we haven't seen that vnode yet, we simply add it with unveil_add_vnode. It's not long, so we will add that to our pile.

    /*
     * At this stage with have a unveil in uv with a vnode for a
     * directory. If the component we are adding is a directory,
     * we are done. Otherwise, we add the component name the name
     * list in uv.
     */

    if (directory_add) {
        uv->uv_flags = flags;
        ret = 0;

        DPRINTF("unveil: %s(%d): added unrestricted directory vnode %p"
            ", uvcount %d\n",
            pr->ps_comm, pr->ps_pid, vp, vp->v_uvcount);
        goto done;
    }

    if (unveil_add_name(uv, ndp->ni_cnd.cn_nameptr, flags))
        pr->ps_uvncount++;
    ret = 0;

    DPRINTF("unveil: %s(%d): added name %s beneath %s vnode %p,"
        " uvcount %d\n",
        pr->ps_comm, pr->ps_pid, ndp->ni_cnd.cn_nameptr,
        uv->uv_flags ? "unrestricted" : "restricted",
        vp, vp->v_uvcount);

 done:
    return ret;
}

I think the comment here explains it more concise than I could write up here, so let's move on.

The little helpers

struct unveil *
unveil_add_vnode(struct proc *p, struct vnode *vp)
{
    struct process *pr = p->p_p;
    struct unveil *uv = NULL;
    ssize_t i;

    KASSERT(pr->ps_uvvcount < UNVEIL_MAX_VNODES);

    uv = &pr->ps_uvpaths[pr->ps_uvvcount++];
    rw_init(&uv->uv_lock, "unveil");
    RBT_INIT(unvname_rbt, &uv->uv_names);
    uv->uv_vp = vp;
    uv->uv_flags = 0;

Grab the next free unveil node and read-write lock it. Then we initialize a red-black tree for the terminal names.

    /* find out what we are covered by */
    uv->uv_cover = unveil_find_cover(vp, p);

    /*
     * Find anyone covered by what we are covered by
     * and re-check what covers them (we could have
     * interposed a cover)
     */
    for (i = 0; i < pr->ps_uvvcount - 1; i++) {
        if (pr->ps_uvpaths[i].uv_cover == uv->uv_cover)
            pr->ps_uvpaths[i].uv_cover =
                unveil_find_cover(pr->ps_uvpaths[i].uv_vp, p);
    }

    return (uv);
}

Right, from what I can gather about uv_cover, it's this:

unveil("/foo", "rw"), now /foo is covered
any access to say /foo/bar/something will have to walk up to check if it hits something that covers it. In this case /foo
when we do unveil("/foo/bar", "r"), we have to add /foo/bar inbetween that path, so now /foo/bar/something is covered by /foo/bar

Conclusion

This one was a bit harder to get through. There's more components and moving parts involved. But again, it mostly seems to be a simple, yet elegant, solution.

But wait…what's missing here? When do we actually check if a file access is allowed? Well, that we will look at this weekend. One of the involved functions is a bit longer, so we will split that into the next post.

In case I have gotten something egregiously wrong, feel free to yell at me over on Mastodon. Any other feedback also welcome.