> This is my second trial to ask incorporating aufs into mainline. Basic Aufs Internal Structure Superblock/Inode/Dentry/File Objects ---------------------------------------------------------------------- As like an ordinary filesystem, aufs has its own superblock/inode/dentry/file objects. All these objects have a dynamically allocated array and store the same kind of pointers to the lower filesystem, branch. For example, when you build a union with one readwrite branch and one readonly, mounted /au, /rw and /ro respectively. - /au = /rw + /ro - /ro/fileA exists but /rw/fileA Aufs lookup operation finds /ro/fileA and gets dentry for that. These pointers are stored in a aufs dentry. The array in aufs dentry will be, - [0] = NULL - [1] = /ro/fileA This style of an array is essentially same to the aufs superblock/inode/dentry/file objects. Because aufs supports manipulating branches, ie. add/delete/change dynamically, these objects has its own generation. When branches are changed, the generation in aufs superblock is incremented. And a generation in other object are compared when it is accessed. When a generation in other objects are obsoleted, aufs refreshes the internal array. Superblock ---------------------------------------------------------------------- Additionally aufs superblock has some data for policies to select one among multiple writable branches, XIB files, pseudo-links and kobject. See below in detail. About the policies which supports copy-down a directory, see policy.txt too. Branch and XINO(External Inode Number Translation Table) ---------------------------------------------------------------------- Every branch has its own xino (external inode number translation table) file. The xino file is created and unlinked by aufs internally. When two members of a union exist on the same filesystem, they share the single xino file. The struct of a xino file is simple, just a sequence of aufs inode numbers which is indexed by the lower inode number. In the above sample, assume the inode number of /ro/fileA is i111 and aufs assigns the inode number i999 for fileA. Then aufs writes 999 as 4(8) bytes at 111 * 4(8) bytes offset in the xino file. Also a writable branch has three kinds of "whiteout bases". All these are existed when the branch is joined to aufs and the names are whiteout-ed doubly, so that users will never see their names in aufs hierarchy. 1. a regular file which will be linked to all whiteouts. 2. a directory to store a pseudo-link. 3. a directory to store an "orphan-ed" file temporary. 1. Whiteout Base When you remove a file on a readonly branch, aufs handles it as a logical deletion and creates a whiteout on the upper writable branch as a hardlink of this file in order not to consume inode on the writable branch. 2. Pseudo-link Dir See below, Pseudo-link. 3. Step-Parent Dir When "fileC" exists on the lower readonly branch only and it is opened and removed with its parent dir, and then user writes something into it, then aufs copies-up fileC to this directory. Because there is no other dir to store fileC. After creating a file under this dir, the file is unlinked. Because aufs supports manipulating branches, ie. add/delete/change dynamically, a branch has its own id. When the branch order changes, aufs finds the new index by searching the branch id. Pseudo-link ---------------------------------------------------------------------- Assume "fileA" exists on the lower readonly branch only and it is hardlinked to "fileB" on the branch. When you write something to fileA, aufs copies-up it to the upper writable branch. Additionally aufs creates a hardlink under the Pseudo-link Directory of the writable branch. The inode of a pseudo-link is kept in aufs super_block as a simple list. If fileB is read after unlinking fileA, aufs returns filedata from the pseudo-link instead of the lower readonly branch. Because the pseudo-link is based upon the inode, to keep the inode number by xino (see above) is important. All the hardlinks under the Pseudo-link Directory of the writable branch should be restored in a proper location later. Aufs provides a utility to do this. The userspace helpers executed at remounting and unmounting aufs by default. XIB(external inode number bitmap) ---------------------------------------------------------------------- Addition to the xino file per a branch, aufs has an external inode number bitmap in a superblock object. It is also a file such like a xino file. It is a simple bitmap to mark whether the aufs inode number is in-use or not. To reduce the file I/O, aufs prepares a single memory page to cache xib. Aufs implements a feature to truncate/refresh both of xino and xib to reduce the number of consumed disk blocks for these files. Virtual or Vertical Dir ---------------------------------------------------------------------- In order to support multiple layers (branches), aufs readdir operation constructs a virtual dir block on memory. For readdir, aufs calls vfs_readdir() internally for each dir on branches, merges their entries with eliminating the whiteout-ed ones, and sets it to file (dir) object. So the file object has its entry list until it is closed. The entry list will be updated when the file position is zero and becomes old. This decision is made in aufs automatically. The dynamically allocated memory block for the name of entries has a unit of 512 bytes (by default) and stores the names contiguously (no padding). Another block for each entry is handled by kmem_cache too. During building dir blocks, aufs creates hash list and judging whether the entry is whiteouted by its upper branch or already listed. Some people may call it can be a security hole or invite DoS attack since the opened and once readdir-ed dir (file object) holds its entry list and becomes a pressure for system memory. But I'd say it is similar to files under /proc or /sys. The virtual files in them also holds a memory page (generally) while they are opened. When an idea to reduce memory for them is introduced, it will be applied to aufs too. Workqueue ---------------------------------------------------------------------- Aufs sometimes requires privilege access to a branch. For instance, in copy-up/down operation. When a user process is going to make changes to a file which exists in the lower readonly branch only, and the mode of one of ancestor directories may not be writable by a user process. Here aufs copy-up the file with its ancestors and they may require privilege to set its owner/group/mode/etc. This is a typical case of a application character of aufs (see Introduction). Aufs uses workqueue synchronously for this case. It creates its own workqueue. The workqueue is a kernel thread and has privilege. Aufs passes the request to call mkdir or write (for example), and wait for its completion. This approach solves a problem of a signal handler simply. If aufs didn't adopt the workqueue and changed the privilege of the process, and if the mkdir/write call arises SIGXFSZ or other signal, then the user process might gain a privilege or the generated core file was owned by a superuser. But I have a plan to switch to a new credential approach which will be introduced in linux-2.6.29. Also aufs uses the system global workqueue ("events" kernel thread) too for asynchronous tasks, such like handling inotify, re-creating a whiteout base and etc. This is unrelated to a privilege. Most of aufs operation tries acquiring a rw_semaphore for aufs superblock at the beginning, at the same time waits for the completion of all queued asynchronous tasks. Whiteout ---------------------------------------------------------------------- The whiteout in aufs is very similar to Unionfs's. That is represented by its filename. UnionMount takes an approach of a file mode, but I am afraid several utilities (find(1) or something) will have to support it. Basically the whiteout represents "logical deletion" which stops aufs to lookup further, but also it represents "dir is opaque" which also stop lookup. In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively. In order to make several functions in a single systemcall to be revertible, aufs adopts an approach to rename a directory to a temporary unique whiteouted name. For example, in rename(2) dir where the target dir already existed, aufs renames the target dir to a temporary unique whiteouted name before the actual rename on a branch and then handles other actions (make it opaque, update the attributes, etc). If an error happens in these actions, aufs simply renames the whiteouted name back and returns an error. If all are succeeded, aufs registers a function to remove the whiteouted unique temporary name completely and asynchronously to the system global workqueue. Copy-up ---------------------------------------------------------------------- It is a well-known feature or concept. When user modifies a file on a readonly branch, aufs operate "copy-up" internally and makes change to the new file on the upper writable branch. When the trigger systemcall does not update the timestamps of the parent dir, aufs reverts it after copy-up. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html