Hi all, As Al and Christoph have requested, here is the design document for writable overlays (a.k.a. union mounts). It includes a description of our locking strategy. Please read and comment! To go along with this doc, I have rebased our kernel patches against 2.6.31, e2fsprogs against 1.40.9, and util-linux-ng against latest git. Pointers to all these git repositories and a complete UML-based union mounts dev kit can be found here: http://valerieaurora.org/union/ We will post the patches for review soon, but don't let that stop you from reviewing and testing them now. :) Thanks to everyone who already sent patches, tested, or reviewed. A list of everyone who has contributed so far is on the union mounts web page. Thanks, -VAL State of writable overlays (formerly union mounts) ================================================== This version of union mounts is renamed "writable overlays." The goal of this patch set is to support a single read-write file system overlaid on a single read-only file system. "Union mounts" suggests that we support unions of arbitrary numbers and types of file systems, which is not the goal of this patch set. The most recent version of writable overlays can boot to multi-user mode with a writable overlay root file system. open(), truncate(), creat(), unlink(), mkdir(), rmdir(), and rename() work. link(), chmod(), chown(), and chattr() don't work yet. This document describes the architecture and current status of writable overlays, including an item-by-item todo list. Writable overlays (formerly union mounts) ========================================= In this document: - Overview of writable overlays - Terminology - VFS implementation - Locking strategy - VFS/file system interface - Userland interface - NFS interaction - Status - Contributing to writable overlays Overview ======== Writable overlays (formerly known as union mounts) are used to layer a single writable file system over a single read-only file system, with all writes going to the writable file system. The namespace of both file systems appears as a combined whole to userland, with those on the writable file system covering up any matching pathnames on the read-only file system. A few use cases: - Root file system on CD with writes saved to hard drive (LiveCD) - Multiple virtual machines with the same starting root file system - Cluster with NFS mounted root on clients Most if not all of these problems could be solved with a COW block device; however, sharing at the file system level has higher performance and uses less disk space. What writable overlays are not ------------------------------ Writable overlays are not a general-purpose unioning file system. They do not provide a generic "union of namespaces" operation for an arbitrary number of file systems. Many interesting features can be implemented with a generic unioning facility: unioning of more than two file systems, dynamic insertion and removal of branches, online upgrade, etc. Some unioning file systems that do this are UnionFS and AUFS. Unfortunately, the complexity of these feature sets lead to difficult corner cases which so far have been unsolvable in the context of the Linux VFS. Writable overlays avoid these corner cases by reducing the feature set to the bare minimum most requested features: one writable file system layered over one read-only file system. Despite the limitations of writable overlays, the VFS infrastructure it uses are generic enough to be reused by more full-featured unioning file systems. Terminology =========== The main analogy for writable overlays is that a writable file system is mounted "on top" of a read-only file system. Lookups start at the "top" read-write file system and travel "down" to the "bottom" read-only file system only if no blocking entry exists on the top layer. Top layer: The read-write file system. Lookups begin here. Bottom layer: The read-only file system. Lookups end here. Path: Combination of the vfsmount and dentry structure. Follow down: Given a path from the top layer, find the corresponding path on the bottom layer. Follow up: Given a path from the bottom layer, find the corresponding path on the top layer. Whiteout: A directory entry in the top layer that prevents lookups from travelling down to the bottom layer. Created on unlink()/rmdir() if a corresponding directory entry exists in the bottom layer. Opaque: A flag on a directory in the top layer that prevents lookups of entries in this directory from travelling down to the bottom layer (unless there is an explicit fallthru entry allowing that for a particular entry). Set on creation of a directory that replaces a whiteout, and after a directory copyup. Fallthru: A directory entry which allows lookups to "fall through" to the bottom layer for that exact directory entry. This serves as a placeholder for directory entries from the bottom layer during readdir(). Fallthrus override opaque flags. File copyup: Create a file on the top layer that has the same properties and contents as the file with the same pathname on the bottom layer. Directory copyup: Copy up the visible directory entries from the bottom layer as fallthrus in the matching top layer directory. Mark the directory opaque to avoid unnecessary negative lookups on the bottom layer. Examples ======== What happens when I... - creat() /newfile -> creates on top layer - unlink() /oldfile -> creates a whiteout on top layer - Edit /existingfile -> copies up to top layer at open(O_WR) time - truncate /existingfile -> copies up to top layer + N bytes if specified - touch()/chmod()/chown()/etc. -> copies up to top layer - mkdir() /newdir -> creates on top layer - rmdir() /olddir -> creates a whiteout on top layer - mkdir() /olddir after above -> creates on top layer w/ opaque flag - readdir() /shareddir -> copies up entries from bottom layer as fallthrus - link() /oldfile /newlink -> copies up /oldfile, creates /newlink on top layer - symlink() /oldfile /symlink -> nothing special - rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer - rename() dir -> EXDEV Getting to a root file system with a writable overlay: - Mount the base read-only file system as the root file system - Mount the read-only file system again on /newroot - Mount the writable overlay on /newroot: # mount -o union /dev/sda /newroot - pivot_root to /newroot - Start init See scripts/pivot.sh in the UML devkit linked to from: http://valerieaurora.org/union/ VFS implementation ================== Writable overlays are implemented as an integral part of the VFS, rather than as a VFS client file system (i.e., a stacked file system like unionfs or ecryptfs). Implementing writable overlays inside the VFS eliminates the need for duplicate copies of VFS data structures, unnecessary indirection, and code duplication, but requires very maintainable, low-to-zero overhead code. Writable overlays require no change to file systems serving as the read-only layer, and requires some minor support from file systems serving as the read-write layer. File systems that want to be the writable layer must implement the new ->whiteout() and ->fallthru() inode operations, which create special dummy directory entries. union_mount structure --------------------- The primary data structure for writable overlays is the union_mount structure, which connects overlapping directory dentries into a "union stack": struct union_mount { atomic_t u_count; /* reference count */ struct mutex u_mutex; struct list_head u_unions; /* list head for d_unions */ struct list_head u_list; /* list head for mnt_unions */ struct hlist_node u_hash; /* list head for searching */ struct hlist_node u_rhash; /* list head for reverse searching */ struct path u_this; /* this is me */ struct path u_next; /* this is what I overlay */ }; The union_mount is referenced from the corresponding directory's dentry: struct dentry { [...] #ifdef CONFIG_UNION_MOUNT /* * The following fields are used by the VFS based union mount * implementation. Both are protected by union_lock! */ struct list_head d_unions; /* list of union_mounts */ unsigned int d_unionized; /* unions referencing this dentry */ #endif [...] }; Each top layer directory with the potential for a lookup to fall through to the bottom layer has a union_mount structure stored in a union_mount hash table. The union_mount's can be looked up both by the top layer's path (via union_lookup()) and the bottom layer's path (via union_rlookup()). Once you have the path (vfsmount and dentry pair) of a file, the union stack can be followed down, layer by layer, with follow_union_down(), and up with follow_union_mount(). All union_mount's are allocated from a kmem cache when the corresponding dentries are created. union_mount's are allocated when the first referencing dentry is allocated and freed when all of the referencing dentries are freed - that is, the dcache drives the union cache. While writable overlays only use two layers, the union stack infrastructure is capable of supporting an arbitrary number of file system layers (leaving aside locking issues). Todo: - Rename union_mount structure - it's per directory, not per mount Code paths ---------- Writable overlays modify the following key code paths in the VFS: - mount()/umount() - Path lookup - Any path that modifies an existing file Mount ----- Writable overlays are created in two steps: 1. Mount the bottom layer file system read-only in the usual manner. 2. Mount the top layer with the "-o union" option at the same mountpoint. The bottom layer must be read-only and the top layer must be read-write and support whiteouts and fallthrus (indicated by setting the MS_WHITEOUT flag). Currently, the top layer is forced to "noatime" to avoid a copyup on every access of a file. Supporting atime with the current infrastructure would require a copyup on every open(). Currently, the top layer covers all submounts on the read-only file system. This can be inconvenient; e.g., mounting a writable overlay on the root file system after procfs has been mounted. It's not clear what the right behavior is. Also, it may be smarter to mount both read-only and read-write layers in one step, but the mount options get pretty ugly. pivot_root() is supported and is the recommended way to get to a root file system with a writable overlay. Todo: - Rename "-o union" mount option - "overlay"? - Don't permit mounting over read-write submounts - Choose submount covering behavior - Allow atime? Really really read-only file systems: In Linux, any individual file system may be mounted at multiple places in the namespace. The file system may change from read-only to read-write while still mounted. Thus, simply checking that the bottom layer is read-only at the time the writable overlay is mounted over it is pointless, since at any time the bottom layer may become read-write. We need to guarantee that a file system will be read-only for as long as it is the bottom layer of a writable overlay. To do this, we track the number of "read-only users" of a file system in its VFS superblock structure. When we mount a writable overlay over a file system, we increment its read-only user count. The file system can only be mounted read-write if its read-only users count is zero. Todo: - Support really really read-only NFS mounts. See discussion here: http://markmail.org/message/3mkgnvo4pswxd7lp Path lookup ----------- Much of the action in writable overlasy happens during lookup(). First, if we lookup a directory on the bottom layer that doesn't yet exist on the top layer, __link_path_walk() always create a matching directory on the top layer. This way, we never have to walk back up a path, creating directories as we go, before we can copyup a file. Second, if we need to copy up a file, we first (re)look it up with the LOOKUP_TOPMOST flag, which instructs __link_path_walk() to create it on the top layer. Neither directory entries nor file data are copied up in __link_path_walk() - that happens after the lookup, in the caller. The main cut-out to writable overlay code is in do_lookup(): static int do_lookup(struct nameidata *nd, struct qstr *name, struct path *path) { int err; if (IS_MNT_UNION(nd->path.mnt)) goto need_union_lookup; [...] need_union_lookup: err = cache_lookup_union(nd, name, path); if (!err && path->dentry) goto done; err = real_lookup_union(nd, name, path); if (err) goto fail; goto done; cache_lookup_union() looks for the dentry in the dcache, starting at the top layer and following down. If it finds nothing, it returns a negative dentry from the top layer. If it finds a directory, it looks for the same directory in the bottom layer; if that exists, it allocates a union_mount struct and hangs the bottom layer dentry off of it. real_lookup_union() does the same for uncached entries. Todo: - Reorganize cache/hash/real lookup code - lots of code duplication - Turn create-on-topmost test into #ifdef'able function - Rewrite with assumption that topmost directory always exists - Remove duplicated tests and other duplicated code File copyup ----------- Any system call that alters an existing file on the bottom layer (including creating or moving a hard link to it) will trigger a copyup of the target file to the top layer (via union_copyup() or __union_copyup()). This includes: - open(O_WRITE | O_RDWR | O_APPEND | O_DIRECT) - truncate()/ftruncate()/open(O_TRUNC) - link() - rename() - chmod() - chattr() Copyup of a file DOES NOT occur on: - open(O_RDONLY) if noatime - stat() if no atime - creat()/mkdir()/mknod() - symlink() - unlink()/rmdir() >From an application's point of view, the result of an in-kernel file copyup is the logical equivalent of another application updating the file via the rename() pattern: creat() a new file, copy the data over, make changes the copy, and rename() over the old version. Any existing open file descriptors for that file (including those in the same application) refer to a now invisible and unreferenced object that used to have the same pathname. Only opens that occur after the copyup will see updates to the file. Todo: - copyup on chown()/chmod()/chattr() - copyup if atime is enabled? Permission checks ----------------- We want to be sure we have the correct permissions to actually succeed in a system call before copying a file up to avoid unnecessary IO. At present, the permission check for a single system call may be spread out over many hundreds of lines of code (e.g., open()). In order to check permissions, we occasionally need to determine if there is a writable overlay on top of this inode. This requires a full path, but often we only have the inode at this point. In particular, inode_permission() returns EROFS if the inode is on a read-only file system, which is the wrong answer if there is a writable overlay mounted on top of it. Another trouble-maker is may_open(), which both checks permissions for open AND truncates the file if O_TRUNC is specified. It doesn't make any sense to copy up the file and then let may_open() truncate it, but we can't copy it after may_open() truncates it either. The current ugly hack is to pass the full nameidata to may_open() and copyup inside may_open(). Some solutions: - Create __inode_permission() and pass it a flag telling it whether or not to check for a read-only fs. Create union_permission() which takes a path, checks for a union mount, and sets the rofs flag. Place the file copyup call after all the permission checks are completed. Push down the full path into the functions that need it and currently only take the dentry or inode. - For each instance in which we might want to copyup, move permission checks into a new function and call it from a level at which we still have the full path. Pass it an "ignore read-only fs" flag if the file is on a union mount. Pass around the ignore-rofs flag inside the function doing permission checks. If all the permission checks complete successfully, copyup the file. Would require moving truncate out of may_open(). Todo: - On truncate, only copy up the N bytes of file data requested - Make sure above handles truncate beyond EOF correctly - File copyup on chown()/chmod()/chattr() etc. - File copyup on open(O_APPEND) - File copyup on open(O_DIRECT) Impact on non-union kernels and mounts -------------------------------------- Union-related data structures, extra fields, and function calls are #ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in nearly all cases (see include/linux/union.h). The union-specific code in the cache lookup path is out of line. Currently, is_unionized() is pretty heavy-weight: it walks up the mount hierarchy, grabbing the vfsmount lock at each level. It may be possible to simplify this greatly if a writable layer can only cover exactly one mount, rather than a tree of mounts. Todo: - Turn copyup in __link_path_walk() into #ifdef'd function - Do performance tests - Optimize is_unionized() - Properly #ifdef out mount path code Locking strategy ================ The current writable overlay locking strategy is based on the following rules: * Exactly two file systems are unioned * The bottom file system is always read-only * The top file system is always read-write => A file system can never a top and a bottom layer at the same time Additionally, the top layer (the writable overlay) may only be mounted exactly once. Don't think of the writable overlay as a separate independent file system; when it is mounted as a writable overlay, it is only a file system in conjunction with the read-only bottom layer. The read-only bottom layer is an independent file system in and of itself and can be mounted elsewhere, including as the bottom layer for another writable overlay. Thus, we may define a stable locking order in terms of top layer and bottom layer locks, since a top layer is never a bottom layer and a bottom layer is never a top layer. Objects from the bottom layer are never changed (so don't need write locks) and only require atomic operations to manage kernel data structures (ref counts, etc.). Another simplifying assumption is that all directories in a pathname exist on the top layer, as they are created step-by-step during lookup. This prevents us from ever having to walk backwards up the path creating directory entries, which can get complicated especially when you consider the need to prevent topology changes. By implication, parent directories during any operation (rename(), unlink(),etc.) are from the top layer. Dentries for directories from the bottom layer are only ever used by lookup code. The two major problems we avoid with the above rules are: Lock ordering: Imagine two union stacks with the same two file systems: A mounted over B, and B mounted over A. Sometimes locks on objects in both A and B will have to be held simultanously. What order should they be acquired in? Simply acquiring them from top to bottom will create a lock-ordering problem - one thread acquires lock on object from A and then tries for a lock on object from B, while another thread grabs the lock on object from B and then waits for the lock on object from A. Some other lock ordering must be defined. Movement/change/disappearance of objects on multiple layers: A variety of nasty corner cases arise when more than one layer is changing at the same time. Changes in the directory topology and their effect on inheritance are of special concern. Al Viro's canonical email on the subject: http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html We don't try to solve any of these cases, just avoid them in the first place. Todo: Prevent top layer from being mounted more than once. Cross-layer interactions ------------------------ The VFS code simultaneously holds references to and/or modifies objects from both the top and bottom layers in the following cases: Path lookup: Holds i_mutex on top layer directory inode while doing lookups on bottom layer. Grabs i_mutex on bottom layer off and on. Todo: - Is i_mutex on lower directory necessary? File copyup in general: File copyup occurs while holding i_mutex on the parent directory of the top layer. As noted before, an in-kernel file copyup is the logical equivalent of a userspace rename() of an identical file on to this pathname. link(): File copyup of target while holding i_mutex on parent directory on top layer. Followed by a normal link() operation. rename(): First, renaming of directories returns EXDEV. It's not at all reasonable to recursively copy directory trees and userspace has to handle this case anyway. Rename involves two operations on a writable overlay: (1) creation of a whiteout covering the source of the rename, (2) a copyup of the file from the bottom layer. The file copyup does not need to happen atomically, only the whiteout and the new link to the file. I propose that we copyup the source file to the "old" name (rather than directly to the "new" name), and then perform the normal file system rename operation. The only addition is creation of whiteout for the old name. The current rename() implementation is just a hack to get things working and doesn't work at all as described above. Lock order: The file copyup happens before the rename() lock. When we create the whiteout, we will already have the directory i_mutex. Otherwise, locking as usual. Directory copyup: Directory entries are copied up on the first readdir(). We hold the top layer directory i_mutex throughout. A fallthru is created for each entry that appears only on the lower layer. Current patch takes the i_mutex on the bottom layer directory, which doesn't seem to be necessary. VFS-fs interface ================ Read-only layer: No support necessary other than enforcement of really really read-only semantics (done by VFS for local file systems). Writable layer: Must implement two new inode operations: int (*whiteout) (struct inode *, struct dentry *, struct dentry *); int (*fallthru) (struct inode *, struct dentry *); And set the MS_WHITEOUT flag. Whiteouts and fallthrus are most similar to symlinks, since they redirect to an object possibly located in another file system without keeping a reference on it. Todo: - Return correct inode number in d_ino member of struct dirent by one of: - Save inode number of target in fallthru entry itself - Lookup inode number during readdir() - Try re-implementing ext2 as special symlinks - may be much simpler - Implement ext3 (also as symlinks?) - Implement btrfs Supported file systems ---------------------- Any file system can be a read-only layer. File systems must explicitly support whiteouts and fallthrus in order to be a read-write layer. This patch set implements whiteouts for ext2, tmpfs, and jffs2. We have tested ext2, tmpfs, and iso9660 as the read-only layer. Todo: - Test corner cases of case-insensitive/oversensitive file systems NFS interaction =============== NFS is currently not supported as either type of layer. NFS as read-only layer requires support from the server to honor the read-only guarantee needed for the bottom layer. To do this, the server needs to revoke access to clients requesting read-only file systems if the exported file system is remounted read-write or unmounted (during which arbitrary changes can occur). Some recent discussion: http://markmail.org/message/3mkgnvo4pswxd7lp NFS as the read-write layer would require implementation of the ->whiteout() and ->fallthru() methods. DT_WHT directory entries are theoretically already supported. Also, technically the requirement for a readdir() cookie that is stable across reboots comes only from file systems exported via NFSv2: http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html Todo: - Implement whiteout()/fallthru() for NFS - Guarantee really really read-only on NFS exports Userland support ================ The mount command must support the "-o union" mount option and pass the corresponding MS_UNION flag to the kerel. A util-linux git tree with writable overlay support is here: git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git File system utilities must support whiteouts and fallthrus. An e2fsprogs git tree with writable overlay support is here: git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git Currently, whiteout directory entries are not returned to userland. While the directory type for whiteouts, DT_WHT, has been defined for many years, very little userland code handles them. Userland will never see fallthru directory entries. Known non-POSIX behaviors ------------------------- - Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO - Link count may be wrong for files on bottom layer with > 1 link count - Link count on directories will be wrong before readdir() (fixable) - File copyup is the logical equivalent of an update via copy + rename(). Any existing open file descriptors will continue to refer to the read-only copy on the bottom layer and will not see any changes that occur after the copy-up. - rename() of directory fails with EXDEV Status ====== The current writable overlays patch set varies between RFC/prototype and pretty stable, depending on the particular patch. The current patch set boots to multi-user mode with a writable overlay root file system (albeit with some complaints). Some parts of the code were written years ago and have been reviewed, rewritten and tested many times. Other parts were written last month and need review, rewriting, and testing. The commit messages note the state of each patch. The current patch set is against 2.6.31. You can find it here, in the branch "overlay": git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git Non-features ------------ Features we do not currently plan to support as part of writable overlays: Online upgrade: E.g., installing software on a file system NFS exported to clients while the clients are still up and running. Allowing the read-only bottom layer to change while the writable overlay file system is mounted invalidates our locking strategy. Recursive copying of directories: E.g., implementing rename() across layers for directories. Doing an in-kernel copy of a single file is bad enough. Recursively copying a directory is a big no-no. Read-only top layer: The readdir() strategy fundamentally requires the ability to create persistent directory entries on the top layer file system (which may be tmpfs). Numerous alternatives (including in-kernel or in-application caching) exist and are compatible with writable overlays with its writing-readdir() implementation disabled. Creating a readdir() cookie that is stable across multiple readdir()s requires one of: - Write to stable storage (e.g., fallthru dentries) - Non-evictable kernel memory cache (doesn't handle NFS server reboot) - Per-application caching by glibc readdir() Aggregation of multiple read-only file systems: While perfectly reasonable from a user perspective, we just aren't smart enough to figure out the locking problems from a kernel perspective. Sorry! Often these features are supported by other unioning file systems or by other versions of union mounts. Contributing to writable overlays ================================= The writable overlays web page is here: http://valerieaurora.org/union/ It links to: - All git repositories - Documentation - An entire self-contained UML-based dev kit with README, etc. The mailing list for discussing writable overlays is: linux-fsdevel@xxxxxxxxxxxxxxx http://vger.kernel.org/vger-lists.html#linux-fsdevel Thank you for reading! -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html