On Thu 26-03-09 14:17:23, Aneesh Kumar K.V wrote: > On Tue, Mar 24, 2009 at 04:29:59PM +0100, Jan Kara wrote: > > On Tue 24-03-09 15:56:03, Peter Zijlstra wrote: > > > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote: > > > > > > > > Or we could implement ext3_mkwrite() to allocate buffers already when we > > > > make page writeable. But it costs some performace (we have to write page > > > > full of zeros when allocating those buffers, where previously we didn't > > > > have to do anything) and it's not trivial to make it work if pagesize > > > > > blocksize (we should not allocate buffers outside of i_size so if i_size > > > > = 1024, we create just one block in ext3_mkwrite() but then we need to > > > > allocate more when we extend the file). > > > > > > I think this is the best option, failing with SIGBUS when we fail to > > > allocate blocks seems consistent with other filesystems as well. > > I agree this looks attractive at the first sight. But there are drawbacks > > as I wrote - the problem with blocksize < pagesize, slight performance > > decrease due to additional write, > > It should not cause an additional write. Can you let me why it would > result in additional write ? Because if you have a new page, at the time mkwrite() or set_page_dirty() is called, it is just full of zeros. So we attach buffers full of zeros to the running transaction to stand to data=ordered mode requirements. Then these get written out on transaction commit (or they can already contain some data user has written via mmap) but we're going to write them again when writepage() is called on the page. Umm, but yes, thinking more about the details, we clear buffer dirty bits at commit time so if by that time user has copied in all the data, subsequent writepage will find out all the buffers are clean and will not send them to disk. So in this case overhead is going to be just journal_start() + journal_stop(). OTOH mm usually decides to write the page only after some time so if user writes to the page often then we really do one more write. But in this case one additional write is going to be probably lost in the number of total writes of the page. So yes, this is not such a big issue as I though originally. > >page faults doing allocation can take a > > *long* time > > That is true > > >and overall fragmentation is going to be higher (previously > > writepage wrote pages for us in the right order, now we are going to > > allocate in the first-accessed order). So I'm not sure we really want to > > go this way. > > > block allocator should be improved to fix that. For example ext4 > mballoc also look at the logical file block number when doing block > allocation. So if we does enough reservation it should handle the > the first-accessed order and sequential order allocation properly. Well, we could definitely improve ext3 allocator. But do we really want to backport mballoc to ext3? IMO It is easier to essentialy perform delayed allocation at the time of mkwrite() and the do the real allocation at the time of writepage(). So I'd rather vote for a mechanism I write about below. > Another reason why I think we would need ext3_page_mkwrite is, if we > really are out of space how do we handle it ? Currently the patch you > posted does redirty_page_for_writepage, which would imply we can't > reclaim the page and since get_block get ENOSPC we can't allocate > blocks. I definitely agree we should somehow solve this problem but the mechanism below seems to be an easier way to me. > > Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve > > some space in mkwrite() but don't actually allocate it. That would be done > > in writepage(). This would solve all the problems I describe above. We could > > use PG_Checked flag to track that the page has a reservation and behave > > accordingly in writepage() / invalidatepage(). ext3 in data=journal mode > > already uses the flag but the use seems to be compatible with what I want > > to do now... So it may actually work. > > BTW: Note that there's a plenty of filesystems that don't implement > > mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with > > ENOSPC. So I'd not speak too much about consistency ;). Honza -- Jan Kara <jack@xxxxxxx> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html