On Tue, May 3, 2011 at 1:20 AM, Jan Kara <jack@xxxxxxx> wrote: > On Mon 02-05-11 14:12:30, Andrew Morton wrote: >> On Mon, 2 May 2011 22:56:56 +0200 >> Jan Kara <jack@xxxxxxx> wrote: >> >> > So far, ext3 was allocating necessary blocks for mmapped writes when >> > writepage() was called. There are several issues with this. The worst >> > being that user is allowed to arbitrarily exceed disk quotas because >> > writepage() is called from flusher thread context (which is root) and thus >> > quota limits are ignored. Another bad consequence is that data is just lost >> > if we find there's no space on the filesystem during ->writepage() time. >> > >> > We solve these issues by implementing block reservation in page_mkwrite() >> > callback. We don't want to really allocate blocks on page_mkwrite() time >> > because for random writes via mmap (as seen for example with applications using >> > BerkeleyDB) it results in much more fragmented files and thus much worse >> > performance. So we allocate indirect blocks and reserve space for data block in >> > page_mkwrite() and do the allocation of data block from writepage(). >> >> Yes, instantiating the metadata and accounting the data is a good >> approach. The file layout will be a bit suboptimal, but surely that >> will be a minor thing. >> >> But boy, it's a complicated patch! Are we really sure that we want to >> make changes this extensive to our antiquated old fs? Or do we just >> say "yeah, it's broken with quotas - use ext4"? > The patch isn't trivial, I agree (although it's mostly straightforward). > Regarding telling users to switch to ext4 - it seems a bit harsh to me > to ask people to switch to ext4 as a response to a (possibly security) > issue they uncover. Because for most admins switching to ext4 will require > some non-trivial testing I presume. Of course, the counterweight is the > possibility of new bugs introduced to the code by my patch. But after some > considerations I've decided it's worth it and and fixed the bug... Jan, Maybe you can work out a simpler(=safer) patch, which uses much larger reservation margins. For example, by reserving on mmap call, the difference between i_size and i_blocks and storing that reservation in in-memory inode, protected by i_mutex. If an in-memory inode RESERVATION flag is set, you can update i_reserved_blocks when updating i_blocks (alloc and free) and i_size (truncate). global in-memory reservation counters can be updated at the same time. This approach will avoid the need to introduce delayed allocation to ext3, since you got the reservation on mmap time, you can rest assure that no dirty data will be dropped on the floor and no quota limits will be exceeded. The trade off is that an application that want to mmap a very large sparse file to write just some blocks will not be able to do that on a near full file system. Do you know of any such real application? Amir. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html