Hello fs-developers, I am developing a stackable unification filesystem which unifies several directories and provides a merged single directory. I guess most people already knows what it is. When users access a file, the access will be passed/re-directed/converted (sorry, I am not sure which English word is correct) to the real file on the member filesystem. The member filesystem is called 'lower filesytstem' or 'branch' and has a mode 'readonly' and 'readwrite.' And the file deletion is handled as 'whiteout' on the upper writable branch. On this ML, there have been discussions about UnionMount (Jan Blunck and Bharata B Rao) and Unionfs (Erez Zadok). They took different approaches to implement the merged-view. The former tries putting it into VFS, and the latter implements as a separate filesystem. (If I misunderstand about these implementations, please let me know and I shall correct it. Because it is a long time ago when I read their source files last time.) UnionMount's approach will be able to small, but may be hard to share branches between several UnionMount since the whiteout in it is implemented in the inode on branch filesystem and always shared. According to Bharata's recent post, readdir does not seems to be finished yet. Unionfs has a longer history. When I got the idea of stacking filesystem (Aug 2005), it already existed. It has virtual super_block, inode, dentry and file objects and they have an array pointing lower same kind objects. After contributing many patches for Unionfs, I re-started my project AUFS (Jun 2006). In AUFS, the structure of filesystem is simlilar to Unionfs, but I implemented my own ideas, approaches and enhancements in it. Here are some of them and the intention of this post is to get some initial feedback about its design. You can see the actual details, documents, CVS logs, and how people are using it from <http://aufs.sf.net>. Kindly review and let me know your comments. o file mapping -- mmap and sharing pages ---------------------------------------------------------------------- In AUFS, the file-mapped pages are shared between the lower file and the AUFS's virtual one by overriding vm_operation, particularly ->fault(). In aufs_mmap(), - get and store vm_ops of the lower file. - map the file of aufs by generic_file_mmap() and set aufs's vm operations. In aufs_fault(), - a race can happen. for instance a multithreaded library. - get the file of aufs from the passed vma, sleep if needed. - get the lower file from the aufs file. - call ->fault() in the previously stored vm_ops with setting the lower file to vm_file. - restore vm_file and wake_up if someone else got sleep. When a member filesystem is added to or deleted from the stack (often called union), the same-named file may unveil and its contents will be replaced by the new one when a process read(2) through previously opened file. (Some users may not want to refresh the filedata. For such users, I have a plan to implement a mount option 'refrof' which decides to refresh the opened files or not.) In this case, an already mapped file will not be updated since the contents are a part of a process and it should not be changed by AUFS branch management. Of course, in case of the deleting branch has a busy file, it cannot be deleted from the union. In UnionMount, it won't be matter since it doesn't have its own inode and file object. In Unionfs, the memory pages mapped to filedata are copied from the lower (real) file into the Unionfs's virtual one and handles it by address_space operations. Recently Unionfs changed it to the one I suggested in last December which AUFS took (since Jul 2006). o external inode number table and bitmap (XINO/XIB) ---------------------------------------------------------------------- Because aufs has its own virtual inode, it has to manage the inode number. Generally iunique() is used for this purpose, but when a user execute chmod/chown -R to a large directory or rmdir to a dir who has child, a problem may arise. Because chmod/chown -R checks the inode number, it may be changed/re-assigned silently/internally and the command will return an error. In rmdir, dentry_unhash() is called and its child dentry/inode is unhashed. It means the inode number for the child will be changed/re-assigned when then will be accessed again. To keep the inode number unchanged, aufs has an external inode number table and bitmap (which are called 'xino' and 'xib') per a branch filesystem. The table is a regular file which is created on the first writable branch automatically be default. When several branches exist on the same (real) filesystem, those files will be shared. If xino/xib is unnecessary for user, he can specify 'noxino' mount option and disable it. Aufs shows the size of these files via sysfs. Currently these xino/xib are created and deleted at the aufs mount time (the files are still opened), but I have a request from users who are using aufs on NFS server and exporting. So I will implement an option not to delete xino/xib files and re-use it after NFS server reboot. In UnionMount, it won't be matter since it doesn't have its own inode. In Unionfs, they took iunique() approach and still have above problem. But they already started Unionfs-ODF branch which has another mounted filesystem and delegate the inode number management to it. The ODF approach has some overhead since it requires to create/remove files/dirs on another filesystem. o cache coherency or user's direct access to branch filesystems (UDBA) -- inotify ---------------------------------------------------------------------- Users may create/delete/change files on branch, bypassing aufs, at anytime (user's direct access, UDBA). Because aufs has its own inode and file objects and they are cached in a generic way, it has to maintain the inode attribute and the directory listing. In order to implement this, aufs has three levels of detect-test. The most strict test is using inotify(CONFIG_INOTIFY) feature. When a user specifies this test level, aufs will set inotify-watch to all the branch dir in cache. When an aufs dir inode object is created and cached, it will refer the real dirs on branches, and aufs sets inotify-watch to them and will be notified when UDBA occurs. The watch will be cleared when the aufs dir inode is purged from the system inode cache. When UDBA occurs, aufs registers a function to 'events' thread by schedule_work(), and the function sets some special status to the cached aufs inode private data. When the same file is accessed through aufs, aufs will detect the status and refresh all necessary data. The other two levels of test don't use inotify. The most simple test level checks nothing. It is for readonly filesystems such as cdrom (Even if the most strict test is specified, aufs doesn't set inotify to such filesystems). The middle level (default) is checking/comparing inode attributes in d_revalidate(). It means this test level may not be effective for a negative dentry. In most cases, I guess the default level is enough and users can execute 'mount -o remount /aufs' to discard the unused caches. But if a user really want to reflect the UDBA soon, the highest test option will help him/her. o hardlink over branches, pseudo-link ---------------------------------------------------------------------- When a file on a lower readonly branch is hard-linked (fileA and fileB) and a user modifies fileA, aufs will copy-up it to the upper writable branch and make the originally requested change to fileA on the upper branch. On the writable branch, fileA is not hardlinked. It means fileB on the lower branch still have the old contents. To address this problem, aufs introduced a 'pseudo-link' (plink) which is a logical hardlink over branches. It maintains the simple inode list on memory and checks the accessed inode is in the list. Finally fileB is handled as if it existed on the writable branch, by referencing fileA's inode on the writable branch as fileB's inode. Additionally, to support the case of fileA on the writable branch is deleted, aufs creates another hardlink on the writable branch which exists under a special directory to hide it from users. At remount/umount time, /sbin/{mount,umount}.aufs script checks the pseudo-linked inode list in aufs, re-produces all real hardlinks on the writable branch, and flushes the list on memory (But these script has a potential race problem). Thank you reading this long and my broken English. Junjiro Okajima -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html