Daily Source Reading: `namei`

`namei`

Yesterday we looked at unveil. But unveil is only one half of the picture. While we looked at unveil, we only learned about how it uses vnodes instead of paths internally. And how it handles coverage, and how it tracks the unveil status.

But that's not the full story yet, is it? With pledge, we saw how it manages things and how it enforces them. So we need to do the same thing for unveil.

With pledge, that happened in pledge_syscall. With unveil, this appears to happen in namei. The function that resolves pathnames to vnodes. Any call that tries to open, create or execute a file, will have to go through this function.

So without further ado, let's dive in. But strap in, because this is going to be one long ride. There's two functions we want to take a look at. And at the same time, they won't just cover unveil, they will also cover some of the most used filesystem functions.

On one hand, I would want to only concentrate on the unveil specific parts, but on the other hand, while we are here, why not explore all of its glory?

It might be a lot, so feel free to skip parts you don't wanna read. I will sprinkle sub-headings in here, called unveil related. If you are only interested in the unveil parts, search for those. But if I already have to go and read and disect the code, I will do it in one shot. But peruse at your own speed and pleasure. I am sure we might be revisiting parts at some point.

Enough of that, here we go:

What's in a name(i)

So, as a user, one might write:

f = fopen("/etc/pf.conf", "r");

But have you ever wondered what happens underneath this? Sure, "/etc/pf.conf" is a path. But that might also be a symlink, right?

So, with symlinks, and symlink to symlinks, how do we actually access files? Like, proper files?

The answer is called a vnode. That is a unique "this file" identifier of sorts. We will get into the absolute details of this later on.

All we need to know for now is that OpenBSD has a function called namei that converts from a pathname to a vnode.

And as a quick exercise, for new and old readers…how would you wanna learn about namei? These blog posts are not 100% free. So before we dive into the code, I at least expect you to man namei if you are on an OpenBSD system or to read the man page.

Done? Good. So here we go:

/*
 * Convert a pathname into a pointer to a vnode.
 *
 * The FOLLOW flag is set when symbolic links are to be followed
 * when they occur at the end of the name translation process.
 * Symbolic links are always followed for all other pathname
 * components other than the last.
 *
 * If the LOCKLEAF flag is set, a locked vnode is returned.
 *
 * The segflg defines whether the name is to be copied from user
 * space or kernel space.
 *
 * Overall outline of namei:
 *
 *  copy in name
 *  get starting directory
 *  while (!done && !error) {
 *      call lookup to search path.
 *      if symbolic link, massage name in buffer and continue
 *  }
 */
int
namei(struct nameidata *ndp)
{
    struct filedesc *fdp;       /* pointer to file descriptor state */
    char *cp;           /* pointer into pathname argument */
    struct vnode *dp;       /* the directory we are searching */
    struct iovec aiov;      /* uio for reading symbolic links */
    struct uio auio;
    int error, linklen;
    struct componentname *cnp = &ndp->ni_cnd;
    struct proc *p = cnp->cn_proc;

    ndp->ni_cnd.cn_cred = ndp->ni_cnd.cn_proc->p_ucred;
#ifdef DIAGNOSTIC
    if (!cnp->cn_cred || !cnp->cn_proc)
        panic ("namei: bad cred/proc");
    if (cnp->cn_nameiop & (~OPMASK))
        panic ("namei: nameiop contaminated with flags");
    if (cnp->cn_flags & OPMASK)
        panic ("namei: flags contaminated with nameiops");
#endif
    fdp = cnp->cn_proc->p_fd;

Our usual start. Declare needed variables, do some diagnostic checks, the mundane stuff.

/*
 * Get a buffer for the name to be translated, and copy the
 * name into the buffer.
 */
if ((cnp->cn_flags & HASBUF) == 0)
    cnp->cn_pnbuf = pool_get(&namei_pool, PR_WAITOK);
if (ndp->ni_segflg == UIO_SYSSPACE) {
    ndp->ni_pathlen = strlcpy(cnp->cn_pnbuf, ndp->ni_dirp,
        MAXPATHLEN);
    if (ndp->ni_pathlen >= MAXPATHLEN) {
        error = ENAMETOOLONG;
    } else {
        error = 0;
        ndp->ni_pathlen++;  /* ni_pathlen includes NUL */
    }
} else
    error = copyinstr(ndp->ni_dirp, cnp->cn_pnbuf,
            MAXPATHLEN, &ndp->ni_pathlen);

There's that pool again (might be good to take a look at that implementation down the road. Let me know if it's something you wanna see.).

If the name is already coming from kernel space (UID_SYSSPACE), then we just strlcpy it. Otherwise we use the copyinstr we saw yesterday, to copy from userspace to kernel space.

    /*
     * Fail on null pathnames
     */
    if (error == 0 && ndp->ni_pathlen == 1)
        error = ENOENT;

    if (error)
        goto fail;

#ifdef KTRACE
    if (KTRPOINT(cnp->cn_proc, KTR_NAMEI))
        ktrnamei(cnp->cn_proc, cnp->cn_pnbuf);
#endif

Self-explanatory with the comment. Moving on.

/*
 *  Strip trailing slashes, as requested
 */
if (cnp->cn_flags & STRIPSLASHES) {
    char *end = cnp->cn_pnbuf + ndp->ni_pathlen - 2;

    cp = end;
    while (cp >= cnp->cn_pnbuf && (*cp == '/'))
        cp--;

    /* Still some remaining characters in the buffer */
    if (cp >= cnp->cn_pnbuf) {
        ndp->ni_pathlen -= (end - cp);
        *(cp + 1) = '\0';
    }
}

Same here. Nothing magic, just proper sanitization, which is also important.

/*
 * Get starting point for the translation.
 */
if ((ndp->ni_rootdir = fdp->fd_rdir) == NULL ||
    (ndp->ni_cnd.cn_flags & KERNELPATH))
  ndp->ni_rootdir = rootvnode;

Before we can actually resolve anything, we need to know where we are starting from. The root directory for the process is grabbed from the file descriptor table. If the process doesn't have one (or KERNELPATH is set), we fall back to the system's rootvnode.

unveil related

And if we really pay attention, we will see our first unveil usage in this code.

if (ndp->ni_cnd.cn_flags & KERNELPATH) {
  ndp->ni_cnd.cn_flags |= BYPASSUNVEIL;
 } else {
  error = pledge_namei(p, ndp, cnp->cn_pnbuf);
  if (error)
    goto fail;
 }

See it? There it is. If the path is from kernel space, then we can bypass unveil. For that, we set the flag here for later.

Why is that? The kernel is already trusted 100%. Wouldn't make much sense to restrict it.

Otherwise, we find ourselves in userspace. In that case, we call pledge_namei. Yes, that is where pledge and unveil work together. Depending on promises, not every process is allowed to access paths and call namei, so we do that here.

Aight, so next up we will stop talking the talk and start walking the walk.

/*
 * Check if starting from root directory or current directory.
 */
if (cnp->cn_pnbuf[0] == '/') {
    dp = ndp->ni_rootdir;
    vref(dp);
    if (cnp->cn_flags & REALPATH && cnp->cn_rpi == 0) {
        cnp->cn_rpbuf[0] = '/';
        cnp->cn_rpbuf[1] = '\0';
        cnp->cn_rpi = 1;
    }

Case 1: the path starts with a /, so it's an absolute path. We also grab a reference to the node, so it doesn't vanish.

The realpath stuff is probably for realpath(3) support, but not entirely read into that, so we skip over it. It's not doing anything unveil related anyway.

} else if (ndp->ni_dirfd == AT_FDCWD) {
    dp = fdp->fd_cdir;
    vref(dp);
    unveil_start_relative(p, ndp, dp);
    unveil_check_component(p, ndp, dp);

Case 2: Otherwise it's a relative path and we check if there's no explicit file descriptor (we'll see soon why).

unveil related

But wait, there's two unveil calls! unveil_start_relative sets up the state machine for relative paths. unveil_check_component checks if the directory is permitted. Otherwise you could chdir somewhere else and use relative paths to access stuff that you shouldn't.

} else {
    struct file *fp = fd_getfile(fdp, ndp->ni_dirfd);
    if (fp == NULL) {
      error = EBADF;
      goto fail;
    }
    dp = (struct vnode *)fp->f_data;
    if (fp->f_type != DTYPE_VNODE || dp->v_type != VDIR) {
      FRELE(fp, p);
      error = ENOTDIR;
      goto fail;
    }
    vref(dp);
    unveil_start_relative(p, ndp, dp);
    unveil_check_component(p, ndp, dp);
    FRELE(fp, p);
  }

Case 3: So, if it's not an absolute path and not a relative path without an explicit file descriptor, then what is it? This seems to be for openat, where the path is not relative to th current directory, but relative to a file descriptor.

We grab the file, check if it's a vnode and make sure it's actually pointing at a directory.

Then we do the exact same checks as in the second case.

Almost there, homestretch. Take a sip of water, hang in. Into the loop we go to check the entire path.

for (;;) {
    if (!dp->v_mount) {
        /* Give up if the directory is no longer mounted */
        vrele(dp);
        error = ENOENT;
        goto fail;
    }

    cnp->cn_nameptr = cnp->cn_pnbuf;
    ndp->ni_startdir = dp;
    if ((error = vfs_lookup(ndp)) != 0)
        goto fail;

Just checking if the directory is actually still mounted and hasn't vanished into thin air while we were resolving. Then we hand everything to vfs_lookup to walk the path for us. We'll take a look at that someday later.

/*
 * If not a symbolic link, return search result.
 */
if ((cnp->cn_flags & ISSYMLINK) == 0) {
    if ((error = unveil_check_final(p, ndp))) {
        if ((cnp->cn_flags & LOCKPARENT) &&
            (cnp->cn_flags & ISLASTCN) &&
            (ndp->ni_vp != ndp->ni_dvp))
            vput(ndp->ni_dvp);
        if (ndp->ni_vp) {
            if ((cnp->cn_flags & LOCKLEAF))
                vput(ndp->ni_vp);
            else
                vrele(ndp->ni_vp);
        }
        goto fail;
    }
    if ((cnp->cn_flags & (SAVENAME | SAVESTART)) == 0)
        pool_put(&namei_pool, cnp->cn_pnbuf);
    else
        cnp->cn_flags |= HASBUF;
    return (0);
}

Path is walked, and it's not a symlink. That means, we are done and can wrap it up. In the middle of that whole soup we can spot it if we squint: unveil_check_final. We have walked the path, it's not a symlink, time for unveil_check_final to run the final confirmation if our vnode is actually allowed to be accessed.

If everybody's happy, we are done.

But what if it actually was a symlink? We will need to find out where it points and then do the whole thing all over again.

if ((cnp->cn_flags & LOCKPARENT) && (cnp->cn_flags & ISLASTCN))
    VOP_UNLOCK(ndp->ni_dvp);
if (ndp->ni_loopcnt++ >= SYMLOOP_MAX) {
    error = ELOOP;
    break;
}

First we will try to avoid endless loops. Imagine a symlink loop of a -> b -> a -> b -> .... So we keep count of how far we have looped. If we exceed SYMLOOP_MAX, we went too far and abort.

        if (ndp->ni_pathlen > 1)
            cp = pool_get(&namei_pool, PR_WAITOK);
        else
            cp = cnp->cn_pnbuf;
        aiov.iov_base = cp;
        aiov.iov_len = MAXPATHLEN;
        auio.uio_iov = &aiov;
        auio.uio_iovcnt = 1;
        auio.uio_offset = 0;
        auio.uio_rw = UIO_READ;
        auio.uio_segflg = UIO_SYSSPACE;
        auio.uio_procp = cnp->cn_proc;
        auio.uio_resid = MAXPATHLEN;
        error = VOP_READLINK(ndp->ni_vp, &auio, cnp->cn_cred);
        if (error) {
badlink:
            if (ndp->ni_pathlen > 1)
                pool_put(&namei_pool, cp);
            break;
        }

Then we attempt to read the link with VOP_READLINK. Let's skip over all the auio struct stuff. Too tired the read up on those parameters and not relevant for unveil.

If there's still path left after the symlink, we grab more buffer.

linklen = MAXPATHLEN - auio.uio_resid;
if (linklen == 0) {
    error = ENOENT;
    goto badlink;
}
if (linklen + ndp->ni_pathlen >= MAXPATHLEN) {
    error = ENAMETOOLONG;
    goto badlink;
}

Just some error checking, making sure we don't exceed path lengths after the symlink is resolved or make sure the symlink is actually pointing to something and not just empty.

if (ndp->ni_pathlen > 1) {
    memcpy(cp + linklen, ndp->ni_next, ndp->ni_pathlen);
    pool_put(&namei_pool, cnp->cn_pnbuf);
    cnp->cn_pnbuf = cp;
} else
    cnp->cn_pnbuf[linklen] = '\0';
ndp->ni_pathlen += linklen;
vput(ndp->ni_vp);
dp = ndp->ni_dvp;

Here we just build the real path by stitching together the resolved symlink with the remaining path.

If the symlink was the last component, we null-terminate it, and then we are ready to run the loop again to hopefully finally resolve everything.

    /*
     * Check if root directory should replace current directory.
     */
    if (cnp->cn_pnbuf[0] == '/') {
        vrele(dp);
        dp = ndp->ni_rootdir;
        vref(dp);
        ndp->ni_unveil_match = NULL;
        unveil_check_component(p, ndp, dp);
        if (cnp->cn_flags & REALPATH) {
            cnp->cn_rpbuf[0] = '/';
            cnp->cn_rpbuf[1] = '\0';
            cnp->cn_rpi = 1;
        }
    } else if (cnp->cn_flags & REALPATH) {
        component_pop(cnp);
    }
}

If we are at an absolute path after resolving all links, we have to do it all over again. We started relative and after resolving links it's absolute. So we immediately run unveil_check_component again.

Then we loop around and let vfs_lookup take another shot at it.

Conclusion

That was a long ass read, but it was worth it. This is not even specific to any filesystem, but applied to all of them.

So we barely scratched the surface. OpenBSD still uses an old filesystem, so we might at some point actually take a look at it. ZFS and the like would be a liiiiittle bit too much for this blog I think.

I like that unveil is not just a layer slapped on top of things, but that it actually integrates into the core of it all. It's not an afterthought, it's right there. If the filesystem changes, unveil shall prevail and still be there.

Next up, as the poll has voted, we shall dive reaaaaally deep into the kernel. We will be looking at the actual boot process, which I might break up into several parts, as there are several stages. And this time we might also go back to compare between OpenBSD and FreeBSD.

So, stay tuned, and as always: please let me know about any feedback or suggestions for new topics.

When in doubt, I might throw up some polls to decide what the next topic is. I do tend to suffer from bias and do enjoy to be asked or pushed to read code-parts that I wouldn't dare to usually read.

Addendum

So far, the feedback and replies I have gotten for this series has been amazing and supportive. So thanks to everyone that has favorited, boosted, replied and participated in any way. It's what gives me a motivation boost to keep this going. It's mostly a self-learning experience, but the continued love is definitely helping to keep me going.

Over time, some code parts might become too big to tackle. So I am thinking about maybe turning this into a form of study group. Discord might be the easiest to set up, but that requires a private vendor lock-in. I am open to suggestions here. Matrix? A mailing list? Disqus?

Daily Source Reading: namei

namei

What's in a name(i)

unveil related

unveil related

Conclusion

Addendum

Daily Source Reading: `namei`

`namei`