Daily Source Reading: namei
namei
Yesterday we looked at unveil. But unveil is only one half of the
picture. While we looked at unveil, we only learned about how it
uses vnodes instead of paths internally. And how it handles
coverage, and how it tracks the unveil status.
But that's not the full story yet, is it? With pledge, we saw how it
manages things and how it enforces them. So we need to do the same
thing for unveil.
With pledge, that happened in pledge_syscall. With unveil, this
appears to happen in namei. The function that resolves pathnames to
vnodes. Any call that tries to open, create or execute a file, will
have to go through this function.
So without further ado, let's dive in. But strap in, because this is
going to be one long ride. There's two functions we want to take a
look at. And at the same time, they won't just cover unveil, they
will also cover some of the most used filesystem functions.
On one hand, I would want to only concentrate on the unveil specific
parts, but on the other hand, while we are here, why not explore all
of its glory?
It might be a lot, so feel free to skip parts you don't wanna read. I
will sprinkle sub-headings in here, called unveil related. If you
are only interested in the unveil parts, search for those. But if I
already have to go and read and disect the code, I will do it in one
shot. But peruse at your own speed and pleasure. I am sure we might
be revisiting parts at some point.
Enough of that, here we go:
What's in a name(i)
So, as a user, one might write:
f = fopen("/etc/pf.conf", "r");
But have you ever wondered what happens underneath this? Sure,
"/etc/pf.conf" is a path. But that might also be a symlink, right?
So, with symlinks, and symlink to symlinks, how do we actually access files? Like, proper files?
The answer is called a vnode. That is a unique "this file" identifier of sorts. We will get into the absolute details of this later on.
All we need to know for now is that OpenBSD has a function called
namei that converts from a pathname to a vnode.
And as a quick exercise, for new and old readers…how would you wanna
learn about namei? These blog posts are not 100% free. So before
we dive into the code, I at least expect you to man namei if you are
on an OpenBSD system or to read the man page.
Done? Good. So here we go:
/* * Convert a pathname into a pointer to a vnode. * * The FOLLOW flag is set when symbolic links are to be followed * when they occur at the end of the name translation process. * Symbolic links are always followed for all other pathname * components other than the last. * * If the LOCKLEAF flag is set, a locked vnode is returned. * * The segflg defines whether the name is to be copied from user * space or kernel space. * * Overall outline of namei: * * copy in name * get starting directory * while (!done && !error) { * call lookup to search path. * if symbolic link, massage name in buffer and continue * } */ int namei(struct nameidata *ndp) { struct filedesc *fdp; /* pointer to file descriptor state */ char *cp; /* pointer into pathname argument */ struct vnode *dp; /* the directory we are searching */ struct iovec aiov; /* uio for reading symbolic links */ struct uio auio; int error, linklen; struct componentname *cnp = &ndp->ni_cnd; struct proc *p = cnp->cn_proc; ndp->ni_cnd.cn_cred = ndp->ni_cnd.cn_proc->p_ucred; #ifdef DIAGNOSTIC if (!cnp->cn_cred || !cnp->cn_proc) panic ("namei: bad cred/proc"); if (cnp->cn_nameiop & (~OPMASK)) panic ("namei: nameiop contaminated with flags"); if (cnp->cn_flags & OPMASK) panic ("namei: flags contaminated with nameiops"); #endif fdp = cnp->cn_proc->p_fd;
Our usual start. Declare needed variables, do some diagnostic checks, the mundane stuff.
/* * Get a buffer for the name to be translated, and copy the * name into the buffer. */ if ((cnp->cn_flags & HASBUF) == 0) cnp->cn_pnbuf = pool_get(&namei_pool, PR_WAITOK); if (ndp->ni_segflg == UIO_SYSSPACE) { ndp->ni_pathlen = strlcpy(cnp->cn_pnbuf, ndp->ni_dirp, MAXPATHLEN); if (ndp->ni_pathlen >= MAXPATHLEN) { error = ENAMETOOLONG; } else { error = 0; ndp->ni_pathlen++; /* ni_pathlen includes NUL */ } } else error = copyinstr(ndp->ni_dirp, cnp->cn_pnbuf, MAXPATHLEN, &ndp->ni_pathlen);
There's that pool again (might be good to take a look at that
implementation down the road. Let me know if it's something you wanna
see.).
If the name is already coming from kernel space (UID_SYSSPACE), then
we just strlcpy it. Otherwise we use the copyinstr we saw
yesterday, to copy from userspace to kernel space.
/* * Fail on null pathnames */ if (error == 0 && ndp->ni_pathlen == 1) error = ENOENT; if (error) goto fail; #ifdef KTRACE if (KTRPOINT(cnp->cn_proc, KTR_NAMEI)) ktrnamei(cnp->cn_proc, cnp->cn_pnbuf); #endif
Self-explanatory with the comment. Moving on.
/* * Strip trailing slashes, as requested */ if (cnp->cn_flags & STRIPSLASHES) { char *end = cnp->cn_pnbuf + ndp->ni_pathlen - 2; cp = end; while (cp >= cnp->cn_pnbuf && (*cp == '/')) cp--; /* Still some remaining characters in the buffer */ if (cp >= cnp->cn_pnbuf) { ndp->ni_pathlen -= (end - cp); *(cp + 1) = '\0'; } }
Same here. Nothing magic, just proper sanitization, which is also important.
/* * Get starting point for the translation. */ if ((ndp->ni_rootdir = fdp->fd_rdir) == NULL || (ndp->ni_cnd.cn_flags & KERNELPATH)) ndp->ni_rootdir = rootvnode;
Before we can actually resolve anything, we need to know where we
are starting from. The root directory for the process is grabbed from
the file descriptor table. If the process doesn't have one (or
KERNELPATH is set), we fall back to the system's rootvnode.
unveil related
And if we really pay attention, we will see our first unveil usage
in this code.
if (ndp->ni_cnd.cn_flags & KERNELPATH) { ndp->ni_cnd.cn_flags |= BYPASSUNVEIL; } else { error = pledge_namei(p, ndp, cnp->cn_pnbuf); if (error) goto fail; }
See it? There it is. If the path is from kernel space, then we can
bypass unveil. For that, we set the flag here for later.
Why is that? The kernel is already trusted 100%. Wouldn't make much sense to restrict it.
Otherwise, we find ourselves in userspace. In that case, we call
pledge_namei. Yes, that is where pledge and unveil work
together. Depending on promises, not every process is allowed to
access paths and call namei, so we do that here.
Aight, so next up we will stop talking the talk and start walking the walk.
/* * Check if starting from root directory or current directory. */ if (cnp->cn_pnbuf[0] == '/') { dp = ndp->ni_rootdir; vref(dp); if (cnp->cn_flags & REALPATH && cnp->cn_rpi == 0) { cnp->cn_rpbuf[0] = '/'; cnp->cn_rpbuf[1] = '\0'; cnp->cn_rpi = 1; }
Case 1: the path starts with a /, so it's an absolute path. We
also grab a reference to the node, so it doesn't vanish.
The realpath stuff is probably for realpath(3) support, but not
entirely read into that, so we skip over it. It's not doing anything
unveil related anyway.
} else if (ndp->ni_dirfd == AT_FDCWD) { dp = fdp->fd_cdir; vref(dp); unveil_start_relative(p, ndp, dp); unveil_check_component(p, ndp, dp);
Case 2: Otherwise it's a relative path and we check if there's no explicit file descriptor (we'll see soon why).
unveil related
But wait, there's two unveil calls! unveil_start_relative sets up
the state machine for relative paths. unveil_check_component checks
if the directory is permitted. Otherwise you could chdir somewhere
else and use relative paths to access stuff that you shouldn't.
} else { struct file *fp = fd_getfile(fdp, ndp->ni_dirfd); if (fp == NULL) { error = EBADF; goto fail; } dp = (struct vnode *)fp->f_data; if (fp->f_type != DTYPE_VNODE || dp->v_type != VDIR) { FRELE(fp, p); error = ENOTDIR; goto fail; } vref(dp); unveil_start_relative(p, ndp, dp); unveil_check_component(p, ndp, dp); FRELE(fp, p); }
Case 3: So, if it's not an absolute path and not a relative path
without an explicit file descriptor, then what is it? This seems to
be for openat, where the path is not relative to th current
directory, but relative to a file descriptor.
We grab the file, check if it's a vnode and make sure it's actually pointing at a directory.
Then we do the exact same checks as in the second case.
Almost there, homestretch. Take a sip of water, hang in. Into the loop we go to check the entire path.
for (;;) { if (!dp->v_mount) { /* Give up if the directory is no longer mounted */ vrele(dp); error = ENOENT; goto fail; } cnp->cn_nameptr = cnp->cn_pnbuf; ndp->ni_startdir = dp; if ((error = vfs_lookup(ndp)) != 0) goto fail;
Just checking if the directory is actually still mounted and hasn't
vanished into thin air while we were resolving. Then we hand
everything to vfs_lookup to walk the path for us. We'll take a look
at that someday later.
/* * If not a symbolic link, return search result. */ if ((cnp->cn_flags & ISSYMLINK) == 0) { if ((error = unveil_check_final(p, ndp))) { if ((cnp->cn_flags & LOCKPARENT) && (cnp->cn_flags & ISLASTCN) && (ndp->ni_vp != ndp->ni_dvp)) vput(ndp->ni_dvp); if (ndp->ni_vp) { if ((cnp->cn_flags & LOCKLEAF)) vput(ndp->ni_vp); else vrele(ndp->ni_vp); } goto fail; } if ((cnp->cn_flags & (SAVENAME | SAVESTART)) == 0) pool_put(&namei_pool, cnp->cn_pnbuf); else cnp->cn_flags |= HASBUF; return (0); }
Path is walked, and it's not a symlink. That means, we are done and
can wrap it up. In the middle of that whole soup we can spot it if we
squint: unveil_check_final. We have walked the path, it's not a
symlink, time for unveil_check_final to run the final confirmation
if our vnode is actually allowed to be accessed.
If everybody's happy, we are done.
But what if it actually was a symlink? We will need to find out where it points and then do the whole thing all over again.
if ((cnp->cn_flags & LOCKPARENT) && (cnp->cn_flags & ISLASTCN)) VOP_UNLOCK(ndp->ni_dvp); if (ndp->ni_loopcnt++ >= SYMLOOP_MAX) { error = ELOOP; break; }
First we will try to avoid endless loops. Imagine a symlink loop of
a -> b -> a -> b -> .... So we keep count of how far we have
looped. If we exceed SYMLOOP_MAX, we went too far and abort.
if (ndp->ni_pathlen > 1) cp = pool_get(&namei_pool, PR_WAITOK); else cp = cnp->cn_pnbuf; aiov.iov_base = cp; aiov.iov_len = MAXPATHLEN; auio.uio_iov = &aiov; auio.uio_iovcnt = 1; auio.uio_offset = 0; auio.uio_rw = UIO_READ; auio.uio_segflg = UIO_SYSSPACE; auio.uio_procp = cnp->cn_proc; auio.uio_resid = MAXPATHLEN; error = VOP_READLINK(ndp->ni_vp, &auio, cnp->cn_cred); if (error) { badlink: if (ndp->ni_pathlen > 1) pool_put(&namei_pool, cp); break; }
Then we attempt to read the link with VOP_READLINK. Let's skip over
all the auio struct stuff. Too tired the read up on those
parameters and not relevant for unveil.
If there's still path left after the symlink, we grab more buffer.
linklen = MAXPATHLEN - auio.uio_resid; if (linklen == 0) { error = ENOENT; goto badlink; } if (linklen + ndp->ni_pathlen >= MAXPATHLEN) { error = ENAMETOOLONG; goto badlink; }
Just some error checking, making sure we don't exceed path lengths after the symlink is resolved or make sure the symlink is actually pointing to something and not just empty.
if (ndp->ni_pathlen > 1) { memcpy(cp + linklen, ndp->ni_next, ndp->ni_pathlen); pool_put(&namei_pool, cnp->cn_pnbuf); cnp->cn_pnbuf = cp; } else cnp->cn_pnbuf[linklen] = '\0'; ndp->ni_pathlen += linklen; vput(ndp->ni_vp); dp = ndp->ni_dvp;
Here we just build the real path by stitching together the resolved symlink with the remaining path.
If the symlink was the last component, we null-terminate it, and
then we are ready to run the loop again to hopefully finally resolve
everything.
/* * Check if root directory should replace current directory. */ if (cnp->cn_pnbuf[0] == '/') { vrele(dp); dp = ndp->ni_rootdir; vref(dp); ndp->ni_unveil_match = NULL; unveil_check_component(p, ndp, dp); if (cnp->cn_flags & REALPATH) { cnp->cn_rpbuf[0] = '/'; cnp->cn_rpbuf[1] = '\0'; cnp->cn_rpi = 1; } } else if (cnp->cn_flags & REALPATH) { component_pop(cnp); } }
If we are at an absolute path after resolving all links, we have to do
it all over again. We started relative and after resolving links it's
absolute. So we immediately run unveil_check_component again.
Then we loop around and let vfs_lookup take another shot at it.
Conclusion
That was a long ass read, but it was worth it. This is not even specific to any filesystem, but applied to all of them.
So we barely scratched the surface. OpenBSD still uses an old
filesystem, so we might at some point actually take a look at it.
ZFS and the like would be a liiiiittle bit too much for this blog I
think.
I like that unveil is not just a layer slapped on top of things, but
that it actually integrates into the core of it all. It's not an
afterthought, it's right there. If the filesystem changes, unveil
shall prevail and still be there.
Next up, as the poll has voted, we shall dive reaaaaally deep into the
kernel. We will be looking at the actual boot process, which I might
break up into several parts, as there are several stages. And this
time we might also go back to compare between OpenBSD and FreeBSD.
So, stay tuned, and as always: please let me know about any feedback or suggestions for new topics.
When in doubt, I might throw up some polls to decide what the next topic is. I do tend to suffer from bias and do enjoy to be asked or pushed to read code-parts that I wouldn't dare to usually read.
Addendum
So far, the feedback and replies I have gotten for this series has been amazing and supportive. So thanks to everyone that has favorited, boosted, replied and participated in any way. It's what gives me a motivation boost to keep this going. It's mostly a self-learning experience, but the continued love is definitely helping to keep me going.
Over time, some code parts might become too big to tackle. So I am thinking about maybe turning this into a form of study group. Discord might be the easiest to set up, but that requires a private vendor lock-in. I am open to suggestions here. Matrix? A mailing list? Disqus?