[RFC 2/2] AUFS: merging/stacking several filesystems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> Here are some of them and the intention of this post is to get some
> initial feedback about its design.
	:::
> Kindly review and let me know your comments.

o readdir -- virtual dir block on memory (VDIR)
----------------------------------------------------------------------
This is an approach I posted a few months ago replying UnionMount's
post. It constructs a virtual dir block on memory. For readdir, aufs
calls vfs_readdir() internally for each lower dirs, merges their
entries with eliminating the whiteout-ed ones, and gives it the the
file (dir) object. So the file object has its entry list until it is
closed. The entry list will be updated when the file position is zero
and becomes old. This decision is made in aufs automatically.

It may consume rather large memory and cpu cycles. To reduce the number
of memory allocations, the implementation became rather tricky .

Some people may call it can be a security hole or DoS attack since the
opened and once readdir-ed dir (file object) holds its entry list and
becomes a pressure for system memory. But I'd say it is similar to
files under /proc or /sys. The virtual files on procfs and sysfs also
holds a memory page (generally) while they are opened. When an idea to
reduce memory for them is introduced, it will be applied to aufs too.


o policies for selecting one among multiple writable branches,
  parent-dir, round-robin and most-free-space
----------------------------------------------------------------------
When the number of writable branch is more than one, aufs has to decide
the target branch for file creation or copy-up. By default, the highest
writable branch which has the parent (or ancestor) dir of the target
file is chosen (top-down-parent policy).
By user's request, aufs has some other policies to select the writable
branch, round-robin and most-free-space policies for file creation, and
top-down-parent, bottom-up-parent and bottom-up policies for copy-up.

As expected, the round-robin policy selects in circular. When you have
two writable branches and creates 10 new files, 5 files will be
created for each branch. mkdir(2) systemcall is an exception. When you
create 10 new directories, all are created on the same branch.
And the most-free-space policy selects the one which has most free
space among the writable branches. The amount of free space will be
checked by aufs internally, and users can specify its time interval.

The policies for copy-up is more simple,
top-down-parent is equivalent to the same named on in create policy,
bottom-up-parent selects the writable branch where the parent dir
exists and the nearest upper one from the copyup-source,
bottom-up selects the nearest upper writable branch from the
copyup-source, regardless the existence of the parent dir.

There are some rules or exceptions to apply these policies.
- If there is a readonly branch above the policy-selected branch and
  the parent dir is marked as opaque (a variation of whiteout), or the
  target (creating) file is whiteout-ed on the upper readonly branch,
  then the policy will be ignored and the target file will be created
  on the nearest upper writable branch than the readonly branch.
- If there is a writable branch above the policy-selected branch and
  the parent dir is marked as opaque or the target file is whiteouted
  on the branch, then the policy will be ignored and the target file
  will be created on the highest one among the upper writable branches
  who has diropq or whiteout. In case of whiteout, aufs removes it as
  usual.
- link(2) and rename(2) systemcalls are exceptions in every policy.
  They try selecting the branch where the source exists as possible
  since copyup a large file will take long time. If it can't be,
  ie. the branch where the source exists is readonly, then they will
  follow the copyup policy.
- There is an exception for rename(2) when the target exists.
  If the rename target exists, aufs compares the index of the branches
  where the source and the target exists and selects the higher
  one. If the selected branch is readonly, then aufs follows the
  copyup policy.


o revert everything after an error on a branch in a single systemcall,
  and remove/rename dir -- temporary name and EXDEV
----------------------------------------------------------------------
Since aufs handles several filesystems internally, it is important to
revert everything after an error happend on a branch internally, and
returns the expected error of systemcall.
To do this, aufs selects only one target writable branch for
create/remove operations and didn't change other
branches. Additionally aufs has to pay attention the order of internal
operaion to make it revertible at any point. The general rule is here.

For creation,
- lock the real dir on the target branch
- lookup a whiteout for the target
- actual creation of the target
- unlink the whiteout for it, if exists
- d_instantiate()
- unlock the real dir

For removal,
- lock the real dir on the target branch
- create a whiteout for the target, if needed
- actual removal of the target, if it exists on the target branch
- unlock the real dir

Generally rename(2) can handle the destination dir which already
exists, and aufs_rename() basically calls vfs_rename() on the writable
branch. When an empty dst-dir exists on the lower branch(es), aufs has
to make the renamed dir opaque (which is a variation of whiteout and
called 'diropq') by creating a special 'diropq' file under the renamed
dir.
If aufs cannot create the 'diropq' file, aufs cannot revert the
previous vfs_rename().

To address this problem, aufs renames the existing dst-dir to the
temporary new whiteout-ed name before the actual vfs_rename(). After
all operations succeeded, aufs_rename() passes the temporary name to
another kernel thread and returns.
The kernel thread removes the temporary name later.
If aufs cannot create the 'diropq' file, it tries vfs_rename() the
src-dir to its old name, and the temporary name to the old dst-dir name.

This approach is implemented in aufs_rmdir() too (except the branch is
NFS), and very effective when the target dir has many whiteouts since
aufs has to unlink the child whiteouts before calling vfs_rmdir().
It may take long time and user has to wait for the completion of
_logically_ empty dir is removed.
With this approach, user don't need to wait so long time.
But the number of child whiteout is not so much, nobody likes this
overhead. So aufs has an option which specifies the threshold of the
number of child whiteouts.

In rename(2), when the target dir has its child on several branches,
aufs_rename() returns -EXDEV, since it may cause many/long internal
copy-up. Generally mv(1) supports this case and retries create/copy
for each children.


Thank you reading this long and my broken English.

Junjiro Okajima
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]
  Powered by Linux