On 07/24/15 17:28, NeilBrown wrote: > Hi Neil, Some edits for you to consider. > > diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.md > new file mode 100644 > index 000000000000..eedf31c737ed > --- /dev/null > +++ b/Documentation/filesystems/path-lookup.md > @@ -0,0 +1,1297 @@ > +<head> > +<style> p { max-width:50em} ol, ul {max-width: 40em}</style> > +</head> > + > +Pathname lookup in Linux. > +========================= > + > +This write-up is based on three articles published at lwn.net: > + > +- <https://lwn.net/Articles/649115/> Pathname lookup in Linux > +- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux > +- <https://lwn.net/Articles/650786/> A walk among the symlinks > + > +Written by Neil Brown with help from Al Viro and Jon Corbet. > + > +Introduction > +------------ > +Bringing it together with `struct nameidata` > +-------------------------------------------- > + > +### `struct path root` ### > + > +This is used to hold a reference to the effective root of the > +filesystem. Often that reference won't be needed, so this field is > +only assigned the first time it is used, or when a non-standard root > +is requested. Keeping a reference in the `nameidata` ensures that > +only one root is in effect for the entire path walk, even if it races > +with a `chroot()` system call. > + > +The root is needed when either of two conditions holds: (1) either the > +pathname or a symbolic link starts with a "'/'", or (2) a "`..`" > +component is being handled, since "`..`" from the root must always stay > +at the root. The value used is usually the current root directory of > +the calling process. An alternate root can be provided as when > +`sysctl()` calls `file_open_root()`, and when NFSv4 or Btrfs call > +`mount_subtree()`. In each case a pathname is being looked up in a very > +specific part of the filesystem, and the lookup must not be allowed to > +escape that subtree. It works a bit like a local `chroot()`. > + > +Ignoring the handling of symbolic links, we can now describe the > +"`link_path_walk()`" function, which handles the lookup of everything > +except the final component as: > + > +> Given a path (`name`) and a nameidata structure (`nd`), check that the > +> current directory has execute permission and then advance `name` > +> over one component while updating `last_type` and `last`. If that > +> was the final component, then return, otherwise call > +> `walk_component()` and repeat from the top. > + > +`walk_component()` is even easier. If the component is `LAST_DOTS`, > +it calls `handle_dots()` which does the necessary locking as already > +described. If it finds a `LAST_NORM` component it first calls > +"`lookup_fast()`" which only looks in the dcache, but will ask the > +filesystem to revalidate the result if it is that sort of filesystem. > +If that doesn't get a good result, it calls "`lookup_slow()`" which > +takes the `i_mutex`, rechecks the cache, and then asks the filesystem > +to find a definitive answer. Each of these will call > +`follow_managed()` (as described below) to handle any mount points. > + > +In the absence of symbolic links, `walk_component()` creates a new > +`struct path` containing a counted reference to the new dentry and a > +reference to the new `vfsmount` which is only counted if it is > +different from the previous `vfsmount`. It then calls > +`path_to_nameidata()` to install the new `struct path` in the > +`struct nameidate` and drop the unneeded references. nameidata > + > +This "hand-over-hand" sequencing of getting a reference to the new > +dentry before dropping the reference to the previous dentry may > +seem obvious, but is worth pointing out so that we will recognize its > +analogue in the "RCU-walk" version. > + > +Handling the final component. > +----------------------------- > + > +`link_path_walk()` only walks as far as setting `nd->last` and > +`nd->last_type` to refer to the final component of the path. It does > +not call `walk_component()` that last time. Handling that final > +component remains for the caller to sort out. Those callers are > +`path_lookupat()`, `path_parentat()`, `path_mountpoint()` and > +`path_openat()` each of which handles the differing requirements of > +different system calls. > + > +`path_parentat()` is clearly the simplest - it just wraps a little bit > +of housekeeping around `link_path_walk()` and returns the parent > +directory and final component to the caller. The caller will be either > +aiming to create a name (via `filename_create()`) or remove or rename > +a name (in which case `user_path_parent()` is used). They will use > +`i_mutex` to exclude other changes while they validate and then > +perform their operation. > + > +`path_lookupat()` is nearly as simple - it is used when an existing > +object is wanted such as by `stat()` or `chmod()`. It essentially just > +calls `walk_component()` on the final component through a call to > +`lookup_last()`. `path_lookupat()` returns just the final dentry. > + > +`path_mountpoint()` handles the special case of unmounting which must > +not try to revalidate the mounted filesystem. It effectively > +contains, through a call to `mountpoint_last()`, an alternate > +implementation of `lookup_slow()` which skips that step. This is > +important when unmounting a filesystem that is inaccessible, such as > +one provided by a dead NFS server. > + > +Finally `path_openat()` is used for the `open()` system call; it > +contains, in support functions starting with "`do_last()`", all the > +complexity needed to handle the different subtleties of O_CREAT (with > +or without O_EXCL) final "`/`" characters, and trailing symbolic missing comma? , final > +links. We will revisit this in the final part of this series, which > +focuses on those symbolic links. "`do_last()`" will sometimes, but > +not always, take `i_mutex`, depending on what it finds. > + > +Each of these, or the functions which call them, need to be alert to > +the possibility that the final component is not `LAST_NORM`. If the > +goal of the lookup is to create something, then any value for > +`last_type` other than `LAST_NORM` will result in an error. For > +example if `path_parentat()` reports `LAST_DOTDOT`, then the caller > +won't try to create that name. They also check for trailing slashes > +by testing `last.name[last.len]`. If there is any character beyond > +the final component, it must be a trailing slash. > + > +Revalidation and automounts > +--------------------------- > + > +Apart from symbolic links, there are only two parts of the "REF-walk" > +process not yet covered. One is the handling of stale cache entries > +and the other is automounts. > + > +On filesystems that require it, the lookup routines will call the > +`->d_revalidate()` dentry method to ensure that the cached information > +is current. This will often confirm validity or update a few details > +from a server. In some cases it may find that there has been change > +further up the path and that something that was thought to be valid > +previously isn't really. When this happens the lookup of the whole > +path is aborted and retried with the "`LOOKUP_REVAL`" flag set. This > +forces revalidation to be more thorough. We will see more details of > +this retry process in the next article. > + > +Automount points are locations in the filesystem where an attempt to > +lookup a name can trigger changes to how that lookup should be > +handled, in particular by mounting a filesystem there. These are > +covered in greater detail in autofs4.txt in the Linux documentation > +tree, but a few notes specifically related to path lookup are in order > +here. > + > +The Linux VFS has a concept of "managed" dentries which is reflected > +in function names such as "`follow_managed()`". There are three > +potentially interesting things about these dentries corresponding > +to three different flags that might be set in `dentry->d_flags`: > + > +RCU-walk - faster pathname lookup in Linux > +========================================== > + > +RCU-walk is another algorithm for performing pathname lookup in Linux. > +It is in many ways similar to REF-walk and the two share quite a bit > +of code. The significant difference in RCU-walk is how it allows for > +the possibility of concurrent access. > + > +We noted that REF-walk is complex because there are numerous details > +and special cases. RCU-walk reduces this complexity by simply > +refusing to handle a number of cases -- it instead falls back to > +REF-walk. The difficulty with RCU-walk comes from a different > +direction: unfamiliarity. The locking rules when depending on RCU are > +quite different to traditional locking, so we will spend a little extra I would say: from > +time when we come to those. > + > +Clear demarcation of roles > +-------------------------- > + > +The easiest way to manage concurrency is to forcibly stop any other > +thread from changing the data structures that a given thread is > +looking at. In cases where no other thread would even think of > +changing the data and lots of different threads want to read at the > +same time, this can be very costly. Even when using locks that permit > +multiple concurrent readers, the simple act of updating the count of > +the number of current readers can impose an unwanted cost. So the > +goal when reading a shared data structure that no other process is > +changing, is to avoid writing anything to memory at all. Take no dubious comma^ > +locks, increment no counts, leave no footprints. > + > +RCU and seqlocks: fast and light > +-------------------------------- > + > +However, there is a little bit more to seqlocks than that. If > +RCU-walk accesses two different fields in a seqlock-protected > +structure, or accesses the same field twice, there is no a-priori no hyphen: ^ > +guarantee of any consistency between those accesses. When consistency > +is needed - which it usually is - RCU-walk must take a copy and then > +use `read_seqcount_retry()` to validate that copy. > + > +`unlazy walk()` and `complete_walk()` > +------------------------------------- > + > +A pair of patterns > +------------------ > + > +In various places in the details of REF-walk and RCU-walk, and also in > +the big picture, there are a couple of related patterns that are worth > +being aware of. > + > +The first is "try quickly and check, if that fails try slowly". We > +can see that in the high-level approach of first trying RCU-walk and > +then trying REF-walk, and in places were `unlazy_walk()` is used to where > +switch to REF-walk for the rest of the path. We also saw it earlier > +in `dget_parent()` when following a "`..`" link. It tries a quick way > +to get a reference, then falls back to taking locks if needed. > + > +A walk among the symlinks > +========================= > + > + > +Storage and lifetime of cached symlinks > +--------------------------------------- > + > + > +When neither of these are suitable, the next most likely scenario is is > +that the filesystem will allocate some temporary memory and copy or > +construct the symlink content into that memory whenever it is needed. > + > +Updating the access time > +------------------------ > + > +We previously said of RCU-walk that it would "take no locks, increment > +no counts, leave no footprints." We have since seen that some > +"footprints" can be needed when handling symlinks as a counted > +reference (or even a memory allocation) may be needed. But these > +footprints are best kept to a minimum. > + > +One other place where walking down a symlink can involve leaving > +footprints in a way that doesn't affect directories is in updating access times. > +In Unix (and Linux) every filesystem object has a "last accessed > +time", or "`atime`". Passing through a directory to access a file > +within, is not considered to be an access for the purposes of drop ^ comma? > +`atime`; only listing the contents of a directory can update its `atime`. > +Symlinks are different it seems. Both reading a symlink (with `readlink()`) > +and looking up a symlink on the way to some other destination can > +update the atime on that symlink. > + -- ~Randy -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html