[RFC 1/2] AUFS: merging/stacking several filesystems

hooanon05@xxxxxxxxxxx · Wed, 02 Apr 2008 14:12:15 +0900

Hello fs-developers,

I am developing a stackable unification filesystem which unifies several
directories and provides a merged single directory.
I guess most people already knows what it is. When users access a file,
the access will be passed/re-directed/converted (sorry, I am not sure
which English word is correct) to the real file on the member
filesystem. The member filesystem is called 'lower filesytstem' or
'branch' and has a mode 'readonly' and 'readwrite.' And the file
deletion is handled as 'whiteout' on the upper writable branch.

On this ML, there have been discussions about UnionMount (Jan Blunck
and Bharata B Rao) and Unionfs (Erez Zadok). They took different
approaches to implement the merged-view.
The former tries putting it into VFS, and the latter implements as a
separate filesystem.
(If I misunderstand about these implementations, please let me know and
I shall correct it. Because it is a long time ago when I read their
source files last time.)
UnionMount's approach will be able to small, but may be hard to share
branches between several UnionMount since the whiteout in it is
implemented in the inode on branch filesystem and always
shared. According to Bharata's recent post, readdir does not seems to
be finished yet.
Unionfs has a longer history. When I got the idea of stacking
filesystem (Aug 2005), it already existed. It has virtual super_block,
inode, dentry and file objects and they have an array pointing lower
same kind objects. After contributing many patches for Unionfs, I
re-started my project AUFS (Jun 2006).

In AUFS, the structure of filesystem is simlilar to Unionfs, but I
implemented my own ideas, approaches and enhancements in it.
Here are some of them and the intention of this post is to get some
initial feedback about its design.
You can see the actual details, documents, CVS logs, and how people
are using it from
<http://aufs.sf.net>.

Kindly review and let me know your comments.

o file mapping -- mmap and sharing pages
----------------------------------------------------------------------
In AUFS, the file-mapped pages are shared between the lower file and
the AUFS's virtual one by overriding vm_operation, particularly
->fault().

In aufs_mmap(),
- get and store vm_ops of the lower file.
- map the file of aufs by generic_file_mmap() and set aufs's vm operations.

In aufs_fault(),
- a race can happen. for instance a multithreaded library.
- get the file of aufs from the passed vma, sleep if needed.
- get the lower file from the aufs file.
- call ->fault() in the previously stored vm_ops with setting the
  lower file to vm_file.
- restore vm_file and wake_up if someone else got sleep.

When a member filesystem is added to or deleted from the stack (often
called union), the same-named file may unveil and its contents will be
replaced by the new one when a process read(2) through previously
opened file.
(Some users may not want to refresh the filedata. For such users, I
have a plan to implement a mount option 'refrof' which decides to
refresh the opened files or not.)
In this case, an already mapped file will not be updated since the
contents are a part of a process and it should not be changed by AUFS
branch management. Of course, in case of the deleting branch has a
busy file, it cannot be deleted from the union.

In UnionMount, it won't be matter since it doesn't have its own inode
and file object.
In Unionfs, the memory pages mapped to filedata are copied from
the lower (real) file into the Unionfs's virtual one and handles it by
address_space operations. Recently Unionfs changed it to the one I
suggested in last December which AUFS took (since Jul 2006).

o external inode number table and bitmap (XINO/XIB)
----------------------------------------------------------------------
Because aufs has its own virtual inode, it has to manage the inode
number. Generally iunique() is used for this purpose, but when a user
execute chmod/chown -R to a large directory or rmdir to a dir who has
child, a problem may arise. Because chmod/chown -R checks the
inode number, it may be changed/re-assigned silently/internally and
the command will return an error. In rmdir, dentry_unhash() is called
and its child dentry/inode is unhashed. It means the inode number for
the child will be changed/re-assigned when then will be accessed again.

To keep the inode number unchanged, aufs has an external inode number
table and bitmap (which are called 'xino' and 'xib') per a branch
filesystem. The table is a regular file which is created on the first
writable branch automatically be default. When several branches exist
on the same (real) filesystem, those files will be shared.
If xino/xib is unnecessary for user, he can specify 'noxino' mount
option and disable it.
Aufs shows the size of these files via sysfs.

Currently these xino/xib are created and deleted at the aufs mount
time (the files are still opened), but I have a request from users who
are using aufs on NFS server and exporting. So I will implement an
option not to delete xino/xib files and re-use it after NFS server
reboot.

In UnionMount, it won't be matter since it doesn't have its own inode.
In Unionfs, they took iunique() approach and still have above
problem. But they already started Unionfs-ODF branch which has another
mounted filesystem and delegate the inode number management to it. The
ODF approach has some overhead since it requires to create/remove
files/dirs on another filesystem.

o cache coherency or user's direct access to branch filesystems
  (UDBA) -- inotify
----------------------------------------------------------------------
Users may create/delete/change files on branch, bypassing aufs, at
anytime (user's direct access, UDBA). Because aufs has its own inode
and file objects and they are cached in a generic way, it has to
maintain the inode attribute and the directory listing.

In order to implement this, aufs has three levels of detect-test. The
most strict test is using inotify(CONFIG_INOTIFY) feature. When a user
specifies this test level, aufs will set inotify-watch to all the
branch dir in cache. When an aufs dir inode object is created and
cached, it will refer the real dirs on branches, and aufs sets
inotify-watch to them and will be notified when UDBA occurs. The watch
will be cleared when the aufs dir inode is purged from the system
inode cache.
When UDBA occurs, aufs registers a function to 'events' thread by
schedule_work(), and the function sets some special status to the
cached aufs inode private data. When the same file is accessed through
aufs, aufs will detect the status and refresh all necessary data.

The other two levels of test don't use inotify. The most simple test
level checks nothing. It is for readonly filesystems such as
cdrom (Even if the most strict test is specified, aufs doesn't set
inotify to such filesystems). The middle level (default) is
checking/comparing inode attributes in d_revalidate(). It means this
test level may not be effective for a negative dentry.
In most cases, I guess the default level is enough and users can execute
'mount -o remount /aufs' to discard the unused caches. But if a user
really want to reflect the UDBA soon, the highest test option will help
him/her.

o hardlink over branches, pseudo-link
----------------------------------------------------------------------
When a file on a lower readonly branch is hard-linked (fileA and
fileB) and a user modifies fileA, aufs will copy-up it to the upper
writable branch and make the originally requested change to fileA on
the upper branch. On the writable branch, fileA is not hardlinked. It
means fileB on the lower branch still have the old contents.

To address this problem, aufs introduced a 'pseudo-link' (plink) which
is a logical hardlink over branches. It maintains the simple inode list
on memory and checks the accessed inode is in the list.
Finally fileB is handled as if it existed on the writable branch, by
referencing fileA's inode on the writable branch as fileB's inode.

Additionally, to support the case of fileA on the writable branch is
deleted, aufs creates another hardlink on the writable branch which
exists under a special directory to hide it from users.

At remount/umount time, /sbin/{mount,umount}.aufs script checks the
pseudo-linked inode list in aufs, re-produces all real hardlinks on
the writable branch, and flushes the list on memory (But these script
has a potential race problem).

Thank you reading this long and my broken English.

Junjiro Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html