Re: UnionMount status?

Michal Suchanek <hramrach@xxxxxxxxxx> · Thu, 1 Apr 2010 17:36:20 +0200

On 24 March 2010 00:02, Valerie Aurora <vaurora@xxxxxxxxxx> wrote:
> On Fri, Mar 19, 2010 at 10:47:15PM +0100, Michal Suchanek wrote:
>> Hello
>>
>> On 19 March 2010 19:03, Valerie Aurora <vaurora@xxxxxxxxxx> wrote:
>>
>> >
>> > Where union mounts is right now is in need of more review from VFS
>> > experts (and thanks to those who have already reviewed it). ??I'm
>>
>> I don't count myself among VFS experts so I'm sorry if I am restating
>> or missing something obvious.
>
> Thanks for taking a look!

Thanks for taking the time to reply.

Apparently I have missed a few properties of the current state of
union mount, especially the fact that directories are only ever stored
in the top layer and the bottom directories are only accessed once
when the merged directory is created in the top layer.

It greatly simplifies lookup operations and avoids some problems with
getting stuck in the bottom layer when something else is already
visible on the top layer.

>
>> > rewriting the in-file copyup code right now, which is dependent on a
>> > lot of ongoing VFS work by Al Viro, Nick Piggin, Dmitriy Monakhov, and
>> > others. ??Here's my description of the problem I'm currently working,
>> > which is where I could use review the most:
>> >
>> > http://groups.google.com/group/linux.kernel/msg/217ca5aedbd7bfd0
>> >
>>
>> On Mar 16, 7:20 pm, Valerie Aurora <vaur...@xxxxxxxxxx> wrote:
>> > This patch shows the basic data flow necessary to implement efficient
>> > union mount file copyup (without the actual file copyup code).  I need
>> > input from other VFS people on design, especially since namei.c is
>> > getting some much needed reorganization.  Some background:
>> >
>> > In union mounts, a file on the lower, read-only file system layer is
>> > copied up to the top layer when an application attempts to write to
>> > it.  This is in-kernel file copyup.  Some constraints make this
>> > difficult:
>> >
>> > 1. Don't copyup if the operation would fail (e.g., open(O_WRONLY) on a
>> > file with mode 444).  It's inefficient and a possible security hole to
>> > copy up a file if no write (or maybe even read) can occur anyway.
>>
>> The open fails in that case anyway so I see no reason to copy
>> anything. Why would you copy before you open?
>
> Basically, it's easiest to copy up the file at pathname lookup time,
> before you do all the permission checks.  This constraint really says
> that you have to wait to do the copy up until you have finished any
> check that might fail.

Yes, returning file from the bottom layer at all complicates things.
However, once you allow for that the case when the bottom file is
returned and later writing to the file is required has to be handled,
apparently. So it would be possible to return the readonly file and
make it readwrite if and when open for writing succeeds.

>
>> On the other hand, when the open succeeds there is nothing stopping
>> the writes from happening save things like hardware failure or lack of
>> disk space. It's appropriate to create an empty inode in this case.
>> Did you consider creating the files as sparse and handling holes by
>> looking into the lower layer before /dev/zero? But then you would
>> perhaps need a flag that differentiates them from real sparse files.
>
> No, I haven't considered that.  Currently, the design is to copy up
> the file at open() time to simplify the code.

A sparse file would make the open() and especially lookup code much
simpler at the cost of adding features elsewhere. Still it splits the
complexity into multiple parts which should be easier to manage and
test separately.

Also it greatly optimizes the case when the top filesystem is tmpfs
and the unchanged blocks do not have to be saved at all as well as the
case when the modified file is removed later.

Still the full copy-up should probably happen on unmount time for
disk-based top filesystem because I do not see any way how the sparse
file could be bound to the bottom filesystem so that the union can be
reliably reconstructed later. It could be made into a special sparse
file that binds to any underlying bottom file but I am not sure this
is desirable. Also the sparse file would have to specify the inode to
which it binds because otherwise renames could be quite expensive.

>
>> Actually the file has to be copied even when it is open for reading
>> because if somebody writes it later the readonly bottom handle would
>> never receive the top updates.
>
> Copying up on every single open costs too much.  Copy up on
> open-for-write does have this odd effect, but I consider it the moral
> equivalent of a process updating a file by copying it to a temporary
> file and then renaming it over the original.

Which did never really happen in this case so it might possibly break
something that tries to synchronize on files.

I am  not sure if there is any file based locking/synchronization
which is supposed to work on plain filesystem but not in this case.

Perhaps some application which has a pre-created file and then mmaps
it into multiple processes (some readonly) would run into this issue.

Thanks

Michal
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html