Document design and implementation of union mounts (a.k.a. writable overlays). --- Documentation/filesystems/union-mounts.txt | 899 ++++++++++++++++++++++++++++ 1 files changed, 899 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/union-mounts.txt diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt new file mode 100644 index 0000000..ba830e8 --- /dev/null +++ b/Documentation/filesystems/union-mounts.txt @@ -0,0 +1,899 @@ +Union mounts (a.k.a. writable overlays) +======================================= + +This document describes the architecture and current status of union +mounts, also known as writable overlays. + +In this document: + - Overview of union mounts + - Terminology + - VFS implementation + - Locking strategy + - VFS/file system interface + - Userland interface + - NFS interaction + - Status + - Contributing to union mounts + +Overview +======== + +A union mount layers one read-write file system over a one read-only +file system, with all writes going to the writable file system. The +namespace of both file systems appears as a combined whole to +userland, with files and directories on the writable file system +covering up any files or directories with matching pathnames on the +read-only file system. The read-write file system is the "topmost" +or "upper" file system and the read-only file system is the "lower" +file system. A few use cases: + +- Root file system on CD with writes saved to hard drive (LiveCD) +- Multiple virtual machines with the same starting root file system +- Cluster with NFS mounted root on clients + +Most if not all of these problems could be solved with a COW block +device or a clustered file system (include NFS mounts). However, for +some use cases, sharing is more efficient and better performing if +done at the file system namespace level. COW block devices only +increase their divergence as time goes on, and a fully coherent +writable file system is unnecessary synchronization overhead if no +other client needs to see the writes. + +What union mounts are not +------------------------- + +Union mounts are not a general-purpose unioning file system. They do +not provide a generic "union of namespaces" operation for an arbitrary +number of file systems. Many interesting features can be implemented +with a generic unioning facility: unioning of more than two file +systems, dynamic insertion and removal of branches, online upgrade, +etc. Some unioning file systems that do this are UnionFS and AUFS. + +File systems can only be union mounted at their mountpoints, and the +lower level file system cannot have any submounts. + +Terminology +=========== + +The main physical metaphor for union mounts is that a writable file +system is mounted "on top" of a read-only file system. Lookups start +at the "topmost" read-write file system and travel "down" to the +"bottom" read-only file system only if no blocking entry exists on the +top layer. + +Topmost layer: The read-write file system. Lookups begin here. + +Bottom layer: The read-only file system. Lookups end here. + +Path: Combination of the vfsmount and dentry structure. + +Follow down: Given a path from the top layer, find the corresponding +path on the bottom layer. + +Follow up: Given a path from the bottom layer, find the corresponding +path on the top layer. + +Whiteout: A directory entry in the top layer that prevents lookups +from travelling down to the bottom layer. Created on unlink()/rmdir() +if a corresponding directory entry exists in the bottom layer. + +Opaque flag: A flag on a directory in the top layer that prevents +lookups of entries in this directory from travelling down to the +bottom layer (unless there is an explicit fallthru entry allowing that +for a particular entry). Set on creation of a directory that replaces +a whiteout, and after a directory copyup. + +Fallthru: A directory entry which allows lookups to "fall through" to +the bottom layer for that exact directory entry. This serves as a +placeholder for directory entries from the bottom layer during +readdir(). Fallthrus override opaque flags. + +File copyup: Create a file on the top layer that has the same metadata +and contents as the file with the same pathname on the bottom layer. + +Directory copyup: Copy up the visible directory entries from the +bottom layer as fallthrus in the matching top layer directory. Mark +the directory opaque to avoid unnecessary negative lookups on the +bottom layer. + +Examples +======== + +What happens when I... + +- creat() /newfile -> creates on topmost layer +- unlink() /oldfile -> creates a whiteout on topmost layer +- Edit /existingfile -> copies up to top layer at open(O_WR) time +- truncate /existingfile -> copies up to topmost layer + N bytes if specified +- touch()/chmod()/chown()/etc. -> copies up to topmost layer +- mkdir() /newdir -> creates on topmost layer +- rmdir() /olddir -> creates a whiteout on topmost layer +- mkdir() /olddir after above -> creates on topmost layer w/ opaque flag +- readdir() /shareddir -> copies up entries from bottom layer as fallthrus +- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on topmost layer +- symlink() /oldfile /symlink -> nothing special +- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer +- rename() /olddir /newdir -> EXDEV +- rename() /topmost_only_dir /topmost_only_dir2 -> success + +Getting to a root file system with union mounts: + +- Mount the base read-only file system as the root file system +- Mount the read-only file system again on /newroot +- Mount the read-write layer on /newroot: + # mount -o union /dev/sda /newroot +- pivot_root to /newroot +- Start init + +See scripts/pivot.sh in the UML devkit linked to from: + +http://valerieaurora.org/union/ + +VFS implementation +================== + +Union mounts are implemented as an integral part of the VFS, rather +than as a VFS client file system (i.e., a stacked file system like +unionfs or ecryptfs). Implementing unioning inside the VFS eliminates +the need for duplicate copies of VFS data structures, unnecessary +indirection, and code duplication, but requires very maintainable, +low-to-zero overhead code. Union mounts require no change to file +systems serving as the read-only layer, and requires some minor +support from file systems serving as the read-write layer. File +systems that want to be the writable layer must implement the new +->whiteout() and ->fallthru() inode operations, which create special +dummy directory entries. + +The union mounts code must accomplish the following major tasks: + +1) Pass lookups through to the lower level file system. +2) Copy files and directories up to the topmost layer when written. +3) Create whiteouts and fallthrus as necessary. + +VFS objects and union mounts +---------------------------- + +First, some VFS basics: + +The VFS allows multiple mounts of the same file system. For example, +/dev/sda can be mounted at /usr and also at /mnt. The same file +system can be mounted read-only at one point and read-write at +another. Each of these mounts has its own vfsmount data structure in +the kernel. However, each underlying file system has exactly one +in-kernel superblock structure no matter how many times it is mounted. +All the separate vfsmounts for the same file system reference the same +superblock data structure. + +Directory entries are cached by the VFS in dentry structures. The VFS +keeps one dentry structure for each file or directory in a file +system, no matter how many times it is mounted. Each dentry +represents only one element of a path name. When the VFS looks up a +pathname (e.g., "/sbin/init"), the result is combination of vfsmount +and dentry. This <mnt,dentry> pair is usually stored in a kernel +structure named "path", which is simply two pointers, one to the +vfsmount and one to the dentry. A "struct path" is this structure; a +pathname is a string like "/etc/fstab". + +As an example, given: + +/dev/sda mounted on /mnt +/dev/sda mounted on /mnt2 + +A pathname lookup for "/mnt/etc" will yield the pair: + +<vfsmount for /mnt, dentry for "etc" on /dev/sda> + +A pathname lookup for "/mnt2/etc" will yield the pair: + +<vfsmount for /mnt2, dentry for "etc" on /dev/sda> + +The dentry in both cases will be the exact same structure in memory. + +A union mount maps <mnt,dentry> pairs from the file system mounted on +the "top" to <mnt,dentry> pairs from the file system on the "bottom." +The same dentry can be a member of more than one union mount. For +example, given: + +/dev/sdb union mounted on top of /dev/sda on /mnt/union1 +/dev/sdc union mounted on top of /dev/sda on /mnt/union2 + +The dentry for the directory "etc/" on /dev/sda will part of two union +mount mappings: + +<vfsmount for /dev/sdb on /mnt/union1, dentry for "etc" on /dev/sdb> + | + v +<vfsmount for /dev/sda on /mnt/union1, dentry for "etc" on /dev/sda> + +And: + +<vfsmount for /dev/sdc on /mnt/union2, dentry for "etc" on /dev/sdb> + | + v +<vfsmount for /dev/sda on /mnt/union2, dentry for "etc" on /dev/sda> + +All of this is to say that we require a full <mnt,dentry> pair to +accomplish any union mount tasks like copying a file to the topmost +layer or looking up a directory entry in a lower layer. A dentry +alone is not sufficient, since it can be part of several different +union mounts. + +union_dir structure +--------------------- + +The first job of union mounts is to map directories from the topmost +layer to directories with the same pathname in the lower layer. That +is, we need to map the <mnt,dentry> pair for a given directory +pathname in the topmost layer to the <mnt,dentry> pair for the +directory with the same pathname in the lower layer. We do this with +the union_dir structure: + +struct union_dir { + atomic_t u_count; /* reference count */ + struct list_head u_unions; /* list head for d_unions */ + struct list_head u_list; /* list head for mnt_unions */ + struct hlist_node u_hash; /* list head for searching */ + struct hlist_node u_rhash; /* list head for reverse searching */ + + struct path u_upper; /* this is me */ + struct path u_lower; /* this is what I overlay */ +}; + +This structure is flexible enough to support an arbitrary number of +layers of unioned file systems, not just the current two-layer +implementation. As such, this section will talk about mapping "upper" +directories to "lower" directories, instead of "topmost" directories +to "bottom" directories. + +At the time of a union mount, we allocate a union_dir structure to map +the root directory of the upper layer to the root directory of the +lower layer. In pseudo-code: + +u_upper = <upper mnt,dentry for "/"> +u_lower = <lower mnt,dentry for "/"> + +This union_dir structure is then added to the union cache hash table, +linked through u_hash, where it can be looked up via union_lookup() +with the <upper mnt,dentry> pair as the key. A reverse lookup is also +included (union_rlookup() using the <lower mnt,dentry> pair, linked +through u_rhash) but is not currently used. + +The union_dir is also added to the list of union_dir structures that +reference this dentry as the topmost dentry. This list is linked +through u_unions member in struct union_dir and the new d_unions +member in struct dentry. The new d_union_lower_count member in struct +dentry is a reference count showing how many unions reference this +dentry through u_lower - that is, how many mounts this dentry is a +lower dentry for. + +struct dentry { +[...] +#ifdef CONFIG_UNION_MOUNT + /* + * Union mount structures that reference this dentry as the + * upper layer are linked through the d_unions field. If this + * list is not empty, then this dentry is part of a unioned + * directory stack. Protected by union_lock. + */ + struct list_head d_unions; + /* + * Reference count of union_dirs with this dentry in the + * u_lower field of a union mount structure - that is, it is a + * dentry for a lower layer of a union. This count is NOT + * incremented for the dentry that is part of the topmost + * layer of a union. + */ + unsigned int d_union_lower_count; +#endif +[...] +}; + +Each union_dir is also linked through the new mnt_unions member in the +vfsmount structure of the upper mount: + +struct vfsmount { +[...] +#ifdef CONFIG_UNION_MOUNT + struct list_head mnt_unions; /* list of union_dir structures */ +#endif +[...] +}; + +Traversing the union stack +-------------------------- + +The set of union_dir structures referring to a particular pathname are +called collectively the union stack for that directory. (In the +current code, only two layers and one union mount structure per path +is allowed, but multiple layers are possible.) Note that in a union +stack, none of the union_dir structures reference each other directly. +Each union_dir struct records the relationship between two +<mnt,dentry> pairs, the upper pair and the lower pair. If a third +layer existed, you would traverse from the top layer to the second +layer by calling union_lookup() on the top layer's <mnt,dentry> pair. +This would return the union_dir struct with u_upper pointing to the +top layer's <mnt,dentry>. Next you would take u_lower, which points +to the second layer's <mnt,dentry> and call union_lookup() on that, +which would return the union_dir mapping the second layer's +<mnt,dentry> to the third layer's <mnt,dentry>. + +To traverse "down" the union stack one layer, use union_down_one(). +Currently, we never traverse the union stack "up" except as part of +the normal VFS follow_mount() operation. follow_mount() is what lets +us traverse from the directory serving as mountpoint to the root +directory of the file system mounted at that mountpoint. Traversing +the union stack "up" introduces lock ordering problems and generally +complicates the code to the point of unmaintainability. Currently, +union mounts performs all its tasks as it traverses the union stack +exactly once, going "down" in the union mounts terminology. + +Code paths +---------- + +Union mounts modify the following key code paths in the VFS: + +- mount()/umount() +- Pathname lookup +- Any path that modifies an existing file + +Mount +----- + +Union mounts are created in two steps: + +1. Mount the bottom layer file system read-only in the usual manner. +2. Mount the top layer with the "-o union" option at the same mountpoint. + +The bottom layer must be read-only and the top layer must be +read-write and support whiteouts and fallthrus. A file system that +supports whiteouts and fallthrus indicates this by setting the +MS_WHITEOUT flag in the superblock. Currently, the top layer is +forced to "noatime" to avoid a copyup on every access of a file. +Supporting atime with the current infrastructure would require a +copyup on every open(). The "relatime" option would be equally +efficient if the atime is the same or more recent than the mtime/ctime +for every object on the read-only file system, and if the 24-hour +timeout on relatime was disabled. However, this is probably not +worthwhile for the majority of union mount use cases. + +The current step-by-step method of mounting union file systems won't +work for three or more layers. Say you want to union mount three file +systems on /mnt/union: + +/dev/bottom - read-only bottom layer +/dev/middle - read-only middle layer +/dev/topmost - read-write topmost layer + +First you mount the bottom layer read-only: + +mount -o ro /dev/bottom /mnt/union + +Then you want to mount the middle layer also read-only, but union +mounts requires that the top layer be read-write in order to support +readdir() correctly: + +mount -o ro,union /dev/middle /mnt/union # WON'T WORK, fails + +The other approach is to mount the middle layer as read-write, but +then the third mount of the topmost layer will fail because the +underlying layer is not read-only: + +mount -o union /dev/middle /mnt/union +mount -o union /dev/topmost /mnt/union # WON'T WORK, fails + +Two obvious options present themselves: + +1) Automatically attempt to convert the covered layer to read-only +status. In this case, the mount of /dev/topmost would attempt to +atomically remount /dev/middle as read-only during sys_mount(). If it +succeeds, it would go on to mount /dev/topmost as read-write and +unioned. This would actually be a usability improvement, since the +administrator need not remember to mount the lower layers read-only. + +2) Execute the mount of all three layers in one system call by passing +a mount option that is a string describing all the devices to be +unioned together. This is ugly for obvious reasons: string parsing in +the kernel, poor error granularity, need to unwind complicated state +if the mount fails partway through the stack. + +The lower layer file system must not have any submounts - other file +systems mounted at points in the lower file system's namespace. File +systems can only be union mounted at their root directories. Without +this restriction, some VFS operations must always do a union_lookup() +- requiring a global lock - in order to find out if a path is +potentially unioned. With this restriction, we can tell if a path is +potentially unioned by checking a flag in the vfsmount. + +pivot_root() to a union mounted file system is supported. The +recommended way to get to a union mounted root file system is to boot +with the read-only mount as the root file system, construct the union +mount on an entirely new mount, and pivot_root() to the new union +mount root. Attempting to union mount the root file system later in +boot will result in covering other file systems, e.g., /proc, which +isn't permitted in the current code and is a bad idea anyway. + +Hard read-only file systems +--------------------------- + +Union mounts require the lower layer of the file system to be +read-only. However, in Linux, any individual file system may be +mounted at multiple places in the namespace, and a file system can be +changed from read-only to read-write while still mounted. Thus, simply +checking that the bottom layer is read-only at the time the writable +overlay is mounted over it is pointless, since at any time the bottom +layer may become read-write. + +We have to guarantee that a file system will be read-only for as long +as it is the bottom layer of a union mount. To do this, we track the +number of hard read-only users of a file system in its VFS superblock +structure. When we union mount a writable overlay over a file system, +we increment its read-only user count. The file system can only be +mounted read-write if its read-only users count is zero. + +Todo: + +- Support hard read-only NFS mounts. See discussion here: + + http://markmail.org/message/3mkgnvo4pswxd7lp + +Pathname lookup +--------------- + +Pathname lookup in a unioned directory traverses down the union stack +for the parent directory, looking up each pathname element in each +layer of the file system (according to the rules of whiteouts, +fallthrus, and opaque flags). At mount time, the union stack for the +root directory of the file system is created, and the union stack +creation for every other unioned directory in the file system is +boot-strapped using the already-existing union stack of the +directory's parent. In order to simplify the code greatly, every +visible directory on the lower file system is required to have a +matching directory on the upper file system. This matching directory +is created during pathname lookup if does not already exist. +Therefore, each unioned directory is the child of another unioned +directory (or is the root directory of the file system). + +As a high-level example, consider lookup of the lower layer file +"/mnt/union/lower_subdir/lower_file" in the union of /dev/lower and +/dev/upper, starting with the <mnt,dentry> pair for the the root +directory of the union mount. + +First, we lookup "lower_subdir" in the parent directory, "/". Since +this is the root directory for the mount, it already has a union stack +constructed, consisting of one struct union_dir in the union hash +table, filled out with: + +um->u_upper = <upper mnt,dentry for "/"> +um->u_lower = <lower mnt,dentry for "/"> + +Using union_down_one(), we traverse the union stack for "/", looking +up "lower_subdir" in the "/" directory for /dev/upper, and then in +/dev/lower. "lower_subdir" only exists in the lower layer, so we +create a matching directory in the upper layer, and then allocate and +fill out a union_dir struct that maps these directories to each other: + +um->u_upper = <upper mnt,dentry for "lower_subdir"> +um->u_lower = <lower mnt,dentry for "lower_subdir"> + +Now lookup proceeds with the <upper mnt,dentry> for "lower_subdir" and +the pathname element "lower_file". We lookup "lower_file" in the +upper layer directory, finding no match. Since this is a unioned +directory, we call union_down_one() on the <upper mnt,dentry for +"lower_subdir">, which lookups up the union_dir structure we just +created and returns the <lower mnt,dentry> pair. We then lookup +"lower_file" in the lower layer directory, which succeeds. Unlike +directories, files are not copied up at lookup time, so pathname +lookup for "/mnt/union/lower_subdir/lower_file" is now complete with +the final struct path of <lower mnt,dentry for "lower_file">. + +At a finer level of detail, the actual union lookup function is called +in the following code paths: + +do_lookup()->do_union_lookup()->lookup_union()->__lookup_union() +lookup_hash()->lookup_union()->__lookup_union() + +__lookup_union() is where the rules of whiteouts, fallthrus, and +opaque flags are actually implemented. __lookup_union() returns +either the first visible dentry, or a negative dentry from the topmost +file system if no matching dentry exists. If it finds a directory, it +looks up any potential matching lower layer directories. If it finds +a lower layer directory, it calls append_to_union() on the pair of +directories. append_to_union() looks up the upper path in the union +cache and if no union cache entry already exists, it creates one. + +Note that not all directories in a union mount are unioned, only those +with matching directories on the lower layer. The macro +IS_UNIONED_DIR() is a cheap, constant time way to check if a directory +is unioned, while IS_MNT_UNION() checks if the entire mount is unioned +(and therefore whether the directory in question is potentially +unioned). + +Currently, lookup of a negative dentry in a unioned directory requires +a lookup in every directory in the union stack every time it is looked +up. We could avoid subsequent lookups by adding a negative union +cache entry, exactly the way negative dentries are cached. + +File copyup +----------- + +Any system call that alters the data or metadata of a file on the +bottom layer, or creates or changes a hard link to it will trigger a +copyup of the target file from the lower layer to the topmost layer + + - open(O_WRITE | O_RDWR | O_APPEND | O_DIRECT) + - truncate()/open(O_TRUNC) + - link() + - rename() + - chmod() + - chown()/lchown() + - utimes() + - setxattr()/lsetxattr() + +Copyup of a file due to open(O_WRITE) has already occurred when: + + - write() + - ftruncate() + - writable mmap() + +The following system calls will fail on an fd opened O_RDONLY: + + - fchmod() + - fchown() + - fsetxattr() + - futimensat() + +Contrary to common sense, the above system calls are defined to +succeed on O_RDONLY fds. The idea seems to be that the +O_RDONLY/O_RDWR/O_WRITE flags only apply to the actual file data, not +to any form of metadata (times, owner, mode, or even extended +attributes). Applications making these system calls on O_RDONLY fds +are correct according to the standard and work on non-union-mounts. +They will need to be rewritten (O_RDONLY -> O_RDWR) to work on union +mounts. We suspect this usage is uncommon. + +This deviation from standard is due to technical limitations of the +union mount implementation. Specifically, we would need to replace an +open file descriptor from the lower layer with an open file descriptor +for a file with matching pathname and contents on the upper layer, +which is difficult to do. We avoid this in other system calls by +doing the copyup before the file is opened. Unionfs doesn't encounter +this problem because it creates a dummy file struct which redirects or +fans out operations to the struct files for the underlying file +systems. + +From an application's point of view, the result of an in-kernel file +copyup is the logical equivalent of another application updating the +file via the rename() pattern: creat() a new file, copy the data over, +make changes the copy, and rename() over the old version. Any +existing open file descriptors for that file (including those in the +same application) refer to a now invisible object that used to have +the same pathname. Only opens that occur after the copyup will see +updates to the file. + +Permission checks +----------------- + +We want to be sure we have the correct permissions to actually succeed +in a system call before copying a file up to avoid unnecessary IO. At +present, the permission check for a single system call may be spread +out over many hundreds of lines of code (e.g., open()). In order to +check permissions, we occasionally need to determine if there is a +writable overlay on top of this inode. This requires a full path, but +often we only have the inode at this point. In particular, +inode_permission() returns EROFS if the inode is on a read-only file +system, which is the wrong answer if there is a writable overlay +mounted on top of it. + +Another trouble-maker is may_open(), which both checks permissions for +open AND truncates the file if O_TRUNC is specified. It doesn't make +any sense to copy up the file and then let may_open() truncate it, but +we can't copy it after may_open() truncates it either. The current +ugly hack is to pass the full nameidata to may_open() and copyup +inside may_open(). + +Some solutions: + +- Create __inode_permission() and pass it a flag telling it whether or + not to check for a read-only fs. Create union_permission() which + takes a path, checks for a union mount, and sets the rofs flag. + Place the file copyup call after all the permission checks are + completed. Push down the full path into the functions that need it + and currently only take the dentry or inode. + +- For each instance in which we might want to copyup, move permission + checks into a new function and call it from a level at which we + still have the full path. Pass it an "ignore read-only fs" flag if + the file is on a union mount. Pass around the ignore-rofs flag + inside the function doing permission checks. If all the permission + checks complete successfully, copyup the file. Would require moving + truncate out of may_open(). + +Todo: + - On truncate, only copy up the N bytes of file data requested + - Make sure above handles truncate beyond EOF correctly + - File copyup on chown()/chmod()/chattr() etc. + - File copyup on open(O_APPEND) + - File copyup on open(O_DIRECT) + +Impact on non-union kernels and mounts +-------------------------------------- + +Union-related data structures, extra fields, and function calls are +#ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in +nearly all cases (see include/linux/union.h). + +Todo: + + - Do performance tests + +Locking strategy +================ + +The current union mount locking strategy is based on the following +rules: + +* Exactly two file systems are unioned +* The bottom file system is always read-only +* The top file system is always read-write + => A file system can never a top and a bottom layer at the same time + +Additionally, the top layer may only be mounted exactly once. Don't +think of the top layer as a separate independent file system; when it +is part of a union mount, it is only a file system in conjunction with +the read-only bottom layer. The read-only bottom layer is an +independent file system in and of itself and can be mounted elsewhere, +including as the bottom layer for another union mount. + +Thus, we may define a stable locking order in terms of top layer and +bottom layer locks, since a top layer is never a bottom layer and a +bottom layer is never a top layer. Another simplifying assumption is +that all directories in a pathname exist on the top layer, as they are +created step-by-step during lookup. This prevents us from ever having +to walk backwards up the path creating directory entries, which can +get complicated. By implication, parent directories paths during any +operation (rename(), unlink(),etc.) are from the top layer. Dentries +for directories from the bottom layer are only ever seen or used by +the lookup code. + +The two major problems we avoid with the above rules are: + +Lock ordering: Imagine two union stacks with the same two file +systems: A mounted over B, and B mounted over A. Sometimes locks on +objects in both A and B will have to be held simultanously. What +order should they be acquired in? Simply acquiring them from top to +bottom will create a lock-ordering problem - one thread acquires lock +on object from A and then tries for a lock on object from B, while +another thread grabs the lock on object from B and then waits for the +lock on object from A. Some other lock ordering must be defined. + +Movement/change/disappearance of objects on multiple layers: A variety +of nasty corner cases arise when more than one layer is changing at +the same time. Changes in the directory topology and their effect on +inheritance are of special concern. Al Viro's canonical email on the +subject: + +http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html + +We don't try to solve any of these cases, just avoid them in the first +place. + +Todo: Prevent top layer from being mounted more than once. + +Cross-layer interactions +------------------------ + +The VFS code simultaneously holds references to and/or modifies +objects from both the top and bottom layers in the following cases: + +Path lookup: + +Grabs i_mutex on bottom layer while holding i_mutex on top layer +directory inode. + +File copyup: + +Holds i_mutex on the parent directory from the top layer while copying +up file from lower layer. + +link(): + +File copyup of target while holding i_mutex on parent directory on top +layer. Followed by a normal link() operation. + +rename(): + +Holds s_vfs_rename_mutex on the top layer, i_mutex of the source's +parent dir (top layer), and i_mutex of the target's parent dir (also +top layer) while looking up and copying the bottom layer target and +also creating the whiteout. + +Notes on rename(): + +First, renaming of directories returns EXDEV. It's not at all +reasonable to recursively copy directory trees and userspace has to +handle this case anyway. An exception is rename() of directories that +exist only on the topmost layer; this succeeds. + +Rename involves three steps on a union mount: (1) copyup of the file +from the bottom layer, (2) rename of the new top-layer copy to the +target in the usual manner, (3) creation of a whiteout covering the +source of the rename. + +Directory copyup: + +Directory entries are copied up on the first readdir(). We hold the +top layer directory i_mutex throughout and sequentially acquire and +drop the i_mutex for each lower layer directory. + +VFS-fs interface +================ + +Read-only layer: No support necessary other than enforcement of really +really read-only semantics (done by VFS for local file systems). + +Writable layer: Must implement two new inode operations: + +int (*whiteout) (struct inode *, struct dentry *, struct dentry *); +int (*fallthru) (struct inode *, struct dentry *); + +And set the MS_WHITEOUT flag to indicate support of these operations. + +Todo: + +- Decide what to return in d_ino of struct dirent + - As Miklos Szeredi points out, the inode number from the underlying + fs is from a different inode "namespace" and doesn't have any + useful meaning in the top layer fs. +- Implement whiteouts and fallthrus in ext3 +- Implement whiteouts and fallthrus in btrfs + +Supported file systems +---------------------- + +Any file system can be a read-only layer. File systems must +explicitly support whiteouts and fallthrus in order to be a read-write +layer. This patch set implements whiteouts for ext2, tmpfs, and +jffs2. We have tested ext2, tmpfs, and iso9660 as the read-only +layer. + +Todo: + - Test corner cases of case-insensitive/oversensitive file systems + +NFS interaction +=============== + +NFS is currently not supported as either type of layer. NFS as +read-only layer requires support from the server to honor the +read-only guarantee needed for the bottom layer. To do this, the +server needs to revoke access to clients requesting read-only file +systems if the exported file system is remounted read-write or +unmounted (during which arbitrary changes can occur). Some recent +discussion: + +http://markmail.org/message/3mkgnvo4pswxd7lp + +NFS as the read-write layer would require implementation of the +->whiteout() and ->fallthru() methods. DT_WHT directory entries are +theoretically already supported. + +Also, technically the requirement for a readdir() cookie that is +stable across reboots comes only from file systems exported via NFSv2: + +http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html + +Todo: + +- Guarantee really really read-only on NFS exports +- Implement whiteout()/fallthru() for NFS + +Userland support +================ + +The mount command must support the "-o union" mount option and pass +the corresponding MS_UNION flag to the kerel. A util-linux git +tree with union mount support is here: + +git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git + +File system utilities must support whiteouts and fallthrus. An +e2fsprogs git tree with union mount support is here: + +git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git + +Currently, whiteout directory entries are not returned to userland. +While the directory type for whiteouts, DT_WHT, has been defined for +many years, very little userland code handles them. Userland will +never see fallthru directory entries. + +Known non-POSIX behaviors +------------------------- + +- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO +- Link count may be wrong for files on bottom layer with > 1 link count +- Link count on directories will be wrong before readdir() (fixable) +- File copyup is the logical equivalent of an update via copy + + rename(). Any existing open file descriptors will continue to refer + to the read-only copy on the bottom layer and will not see any + changes that occur after the copy-up. +- rename() of directory fails with EXDEV +- inode number in d_ino of struct dirent will be wrong for fallthrus +- fchmod()/fchown()/futimensat()/fsetattr() fail on O_RDONLY fds + +Status +====== + +The current union mounts implementation is feature-complete on local +file systems and passes an extensive union mounts test suite, +available in the union mounts Usermode Linux-based development kit: + +http://valerieaurora.org/union/union_mount_devkit.tar.gz + +The whiteout code has had some non-trivial level of review and +testing, but the majority of the rest of the code has had no external +review or testing outside the authors' machines. + +The latest version is available at: + +git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git + +Check the union mounts web page for the name of the latest branch: + +http://valerieaurora.org/union/ + +Todo: + +- Run more tests (e.g., XFS test suite) +- Get review from VFS maintainers + +Non-features +------------ + +Features we do not currently plan to support in union mounts: + +Online upgrade: E.g., installing software on a file system NFS +exported to clients while the clients are still up and running. +Allowing the read-only bottom layer of a union mount to change +invalidates our locking strategy. + +Recursive copying of directories: E.g., implementing rename() across +layers for directories. Doing an in-kernel copy of a single file is +bad enough. Recursively copying a directory is a big no-no. + +Read-only top layer: The readdir() strategy fundamentally requires the +ability to create persistent directory entries on the top layer file +system (which may be tmpfs). Numerous alternatives (including +in-kernel or in-application caching) exist and are compatible with +union mounts with its writing-readdir() implementation disabled. +Creating a readdir() cookie that is stable across multiple readdir()s +requires one of: + +- Write to stable storage (e.g., fallthru dentries) +- Non-evictable kernel memory cache (doesn't handle NFS server reboot) +- Per-application caching by glibc readdir() + +Aggregation of multiple read-only file systems: We are beginning to +see how to implement this but it doesn't currently work. + +Often these features are supported by other unioning file systems or +by other versions of union mounts. + +Contributing to union mounts +============================ + +The union mounts web page is here: + +http://valerieaurora.org/union/ + +It links to: + + - All git repositories + - Documentation + - An entire self-contained UML-based dev kit with README, etc. + +The best mailing list for discussing union mounts is: + +linux-fsdevel@xxxxxxxxxxxxxxx + +http://vger.kernel.org/vger-lists.html#linux-fsdevel + +Thank you for reading! -- 1.6.3.3 -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html