> Here are some of them and the intention of this post is to get some > initial feedback about its design. ::: > Kindly review and let me know your comments. o readdir -- virtual dir block on memory (VDIR) ---------------------------------------------------------------------- This is an approach I posted a few months ago replying UnionMount's post. It constructs a virtual dir block on memory. For readdir, aufs calls vfs_readdir() internally for each lower dirs, merges their entries with eliminating the whiteout-ed ones, and gives it the the file (dir) object. So the file object has its entry list until it is closed. The entry list will be updated when the file position is zero and becomes old. This decision is made in aufs automatically. It may consume rather large memory and cpu cycles. To reduce the number of memory allocations, the implementation became rather tricky . Some people may call it can be a security hole or DoS attack since the opened and once readdir-ed dir (file object) holds its entry list and becomes a pressure for system memory. But I'd say it is similar to files under /proc or /sys. The virtual files on procfs and sysfs also holds a memory page (generally) while they are opened. When an idea to reduce memory for them is introduced, it will be applied to aufs too. o policies for selecting one among multiple writable branches, parent-dir, round-robin and most-free-space ---------------------------------------------------------------------- When the number of writable branch is more than one, aufs has to decide the target branch for file creation or copy-up. By default, the highest writable branch which has the parent (or ancestor) dir of the target file is chosen (top-down-parent policy). By user's request, aufs has some other policies to select the writable branch, round-robin and most-free-space policies for file creation, and top-down-parent, bottom-up-parent and bottom-up policies for copy-up. As expected, the round-robin policy selects in circular. When you have two writable branches and creates 10 new files, 5 files will be created for each branch. mkdir(2) systemcall is an exception. When you create 10 new directories, all are created on the same branch. And the most-free-space policy selects the one which has most free space among the writable branches. The amount of free space will be checked by aufs internally, and users can specify its time interval. The policies for copy-up is more simple, top-down-parent is equivalent to the same named on in create policy, bottom-up-parent selects the writable branch where the parent dir exists and the nearest upper one from the copyup-source, bottom-up selects the nearest upper writable branch from the copyup-source, regardless the existence of the parent dir. There are some rules or exceptions to apply these policies. - If there is a readonly branch above the policy-selected branch and the parent dir is marked as opaque (a variation of whiteout), or the target (creating) file is whiteout-ed on the upper readonly branch, then the policy will be ignored and the target file will be created on the nearest upper writable branch than the readonly branch. - If there is a writable branch above the policy-selected branch and the parent dir is marked as opaque or the target file is whiteouted on the branch, then the policy will be ignored and the target file will be created on the highest one among the upper writable branches who has diropq or whiteout. In case of whiteout, aufs removes it as usual. - link(2) and rename(2) systemcalls are exceptions in every policy. They try selecting the branch where the source exists as possible since copyup a large file will take long time. If it can't be, ie. the branch where the source exists is readonly, then they will follow the copyup policy. - There is an exception for rename(2) when the target exists. If the rename target exists, aufs compares the index of the branches where the source and the target exists and selects the higher one. If the selected branch is readonly, then aufs follows the copyup policy. o revert everything after an error on a branch in a single systemcall, and remove/rename dir -- temporary name and EXDEV ---------------------------------------------------------------------- Since aufs handles several filesystems internally, it is important to revert everything after an error happend on a branch internally, and returns the expected error of systemcall. To do this, aufs selects only one target writable branch for create/remove operations and didn't change other branches. Additionally aufs has to pay attention the order of internal operaion to make it revertible at any point. The general rule is here. For creation, - lock the real dir on the target branch - lookup a whiteout for the target - actual creation of the target - unlink the whiteout for it, if exists - d_instantiate() - unlock the real dir For removal, - lock the real dir on the target branch - create a whiteout for the target, if needed - actual removal of the target, if it exists on the target branch - unlock the real dir Generally rename(2) can handle the destination dir which already exists, and aufs_rename() basically calls vfs_rename() on the writable branch. When an empty dst-dir exists on the lower branch(es), aufs has to make the renamed dir opaque (which is a variation of whiteout and called 'diropq') by creating a special 'diropq' file under the renamed dir. If aufs cannot create the 'diropq' file, aufs cannot revert the previous vfs_rename(). To address this problem, aufs renames the existing dst-dir to the temporary new whiteout-ed name before the actual vfs_rename(). After all operations succeeded, aufs_rename() passes the temporary name to another kernel thread and returns. The kernel thread removes the temporary name later. If aufs cannot create the 'diropq' file, it tries vfs_rename() the src-dir to its old name, and the temporary name to the old dst-dir name. This approach is implemented in aufs_rmdir() too (except the branch is NFS), and very effective when the target dir has many whiteouts since aufs has to unlink the child whiteouts before calling vfs_rmdir(). It may take long time and user has to wait for the completion of _logically_ empty dir is removed. With this approach, user don't need to wait so long time. But the number of child whiteout is not so much, nobody likes this overhead. So aufs has an option which specifies the threshold of the number of child whiteouts. In rename(2), when the target dir has its child on several branches, aufs_rename() returns -EXDEV, since it may cause many/long internal copy-up. Generally mv(1) supports this case and retries create/copy for each children. Thank you reading this long and my broken English. Junjiro Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html