On Fri, Mar 19, 2010 at 10:47:15PM +0100, Michal Suchanek wrote: > Hello > > On 19 March 2010 19:03, Valerie Aurora <vaurora@xxxxxxxxxx> wrote: > > > > > Where union mounts is right now is in need of more review from VFS > > experts (and thanks to those who have already reviewed it). ??I'm > > I don't count myself among VFS experts so I'm sorry if I am restating > or missing something obvious. Thanks for taking a look! > > rewriting the in-file copyup code right now, which is dependent on a > > lot of ongoing VFS work by Al Viro, Nick Piggin, Dmitriy Monakhov, and > > others. ??Here's my description of the problem I'm currently working, > > which is where I could use review the most: > > > > http://groups.google.com/group/linux.kernel/msg/217ca5aedbd7bfd0 > > > > On Mar 16, 7:20 pm, Valerie Aurora <vaur...@xxxxxxxxxx> wrote: > > This patch shows the basic data flow necessary to implement efficient > > union mount file copyup (without the actual file copyup code). I need > > input from other VFS people on design, especially since namei.c is > > getting some much needed reorganization. Some background: > > > > In union mounts, a file on the lower, read-only file system layer is > > copied up to the top layer when an application attempts to write to > > it. This is in-kernel file copyup. Some constraints make this > > difficult: > > > > 1. Don't copyup if the operation would fail (e.g., open(O_WRONLY) on a > > file with mode 444). It's inefficient and a possible security hole to > > copy up a file if no write (or maybe even read) can occur anyway. > > The open fails in that case anyway so I see no reason to copy > anything. Why would you copy before you open? Basically, it's easiest to copy up the file at pathname lookup time, before you do all the permission checks. This constraint really says that you have to wait to do the copy up until you have finished any check that might fail. > On the other hand, when the open succeeds there is nothing stopping > the writes from happening save things like hardware failure or lack of > disk space. It's appropriate to create an empty inode in this case. > Did you consider creating the files as sparse and handling holes by > looking into the lower layer before /dev/zero? But then you would > perhaps need a flag that differentiates them from real sparse files. No, I haven't considered that. Currently, the design is to copy up the file at open() time to simplify the code. > Actually the file has to be copied even when it is open for reading > because if somebody writes it later the readonly bottom handle would > never receive the top updates. Copying up on every single open costs too much. Copy up on open-for-write does have this odd effect, but I consider it the moral equivalent of a process updating a file by copying it to a temporary file and then renaming it over the original. > > 2. Near-zero added cost for operations on non-union paths in a > > CONFIG_UNION_MOUNT kernel. One or two flag tests is okay, but > > grabbing a lock or doing a lookup operation to find out if a file > > should be copied up is too expensive. > > As I see it any file can be a part of union mount in two ways: > > 1) You are just looking it up so you do pay the lookup costs and you > will know where it is and if that is part of a union. Yes, we just have to record and track that information after the lookup. This will require a new flag or structure or structure member. user_path_at() is an example where we need to retain this information but throw it away and just return the path. > 2) it's already open so you have a handle to the file and this handle > can be tagged in some way. It is probably associated with the > originating filesystem anyway so you just need to look that way. In general, by the time you have an open file and you want to alter it, you've already copied it up. But I found two exceptions to this rule: fchown() and fchmod() are legal on an fd opened O_RDONLY. So we'll have to close, copyup, and re-open the file in that case. > > 3. A file should be copied up if the path (mnt + dentry) of its parent > > is from a union top layer (MNT_UNION set in mnt->mnt_flags). We can't > > tell if a file can be copied up by looking at its inode, dentry, or > > even path, we have to know what its parent path is. > > Let's see what the lookup may be like: > > 0) No lookup: The file is open. It's always top, no issue here. > > 1) Perform lookup: > - if the path goes through non-union directory structure nothing > interesting is going on > - if the path is currently being looked up inside an union mount it > can happen that a bottom directory exists without a corresponding top > directory. In this case the top directory has to be created. Otherwise > users who have the bottom directory open would not see new files added > (and removed) in the top layer. The current implementation always copies up directories on lookup. > - the fact that lookup is (or is not) going on inside an union mount > is a single boolean flag which may change only on mountpoint > boundaries. > > During the lookup it's easy to determine if sufficient permissions > exist to open a file. Directory permissions are recorded on the top > directories or are copied from the bottom directories to the top > directories created during lookup. Additionally files which are only > in the bottom layer can have permissions in there. If the file can be > opened then the file can be copied and must be copied on open. > > I guess that the union directories are a bit special in that they are > top directories but have continuation in the bottom directory. Still > this should not concern the user as they should not get the bottom > directory handle nor the top directory handle but a union directory > handle. > > On Wed, Mar 17, 2010 at 07:51:31AM +0900, J. R. Okajima wrote: > > > Valerie Aurora: > > > 1. Don't copyup if the operation would fail (e.g., open(O_WRONLY) on a > > > file with mode 444). It's inefficient and a possible security hole to > > > copy up a file if no write (or maybe even read) can occur anyway. > > > Just a question. > > How about this case? > > When the file is writable (0644 or something) but its parent directory > > is readonly (0555), do you think the file should be copied-up? > > It must appear to be copied up in one way or another. > If the mode can be changed and the FS code does not have some giant > lock that allows only one open handle at a time it cannot rely on the > directory not changing just because it does not appear to have any > write permissions at the time. > > Actually in a very primitive filesystem that locks the whole directory > it might hold that the directory cannot be changed while its mode says > it is readonly but it would still have to be quite non-reentrant for > that as the directory mode and contents should be quite separate. > > J. R. Okajima: > > Valerie Aurora: > > > I think what people will expect is that we copy up in that case. I > > > can think of ways this can go wrong, but perhaps that should be an > > > explicit requirement on the top-layer file system, that it can handle > > > create/unlink() in a directory without write permission. > > > I am not sure such requirement is the right way. > > How about delegating open() to keventd or some other workqueue which > > will succeed to create files under a directory without write permission? > > Of course, we should handle some error cases after creating a file. > > This will fail anyway when the filesystem is actually readonly. I > guess you need the equivalent of ramfs/tmpfs to store the directory > structure. Although if you plan on supporting more than two layers in > the future you may require the top layer to be writable and still > achieve reasonable flexibility. The current implementation stores the directory structure in the topmost directory itself using fallthrus as placeholders for lower level directory entries. Several people have written in-memory solutions, look for Miklos Szeredi, Bharata Rao, and Jan Blunck's patches. Thanks, -VAL -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html