On Mon, Jun 16, 2008 at 05:05:33PM +0200, Jan Kara wrote: > Hi Aneesh, > > First, I'd like to see some short comment on what semantics > delalloc,data=ordered is going to have. At least I can imagine at least > two sensible approaches: > 1) All we guarantee is that user is not going to see uninitialized data. > We send writes to disk (and allocate blocks) whenever it fits our needs > (usually when pdflush finds them). > 2) We guarantee that when transaction commits, your data is on disk - > i.e., we allocate actual blocks on transaction commit. > > Both these possibilities have their pros and cons. Most importantly, > 1) gives better disk layout while 2) gives higher consistency > guarantees. Note that with 1), it can under some circumstances happen, > that after a crash you see block 1 and 3 of your 3-block-write on disk, > while block 2 is still a hole. 1) is easy to implement (you mostly did > it below), 2) is harder. I think there should be broader consensus on > what the semantics should be (changed subject to catch more attention > ;). > > A few comments to your patch are also below. > > Honza The way I was looking at ordered mode was, we only guarantee that the meta-data blocks corresponding to the data block allocated get committed only after the data-blocks are written to the disk. As long as we don't allocate blocks corresponding to a page we don't write the page to disk. This should also speed up the "sync slowness" that lot of people are reporting with ordered mode. Can you explain " 1), it can under some circumstances happen, that after a crash you see block 1 and 3 of your 3-block-write on disk, while block 2 is still a hole. " > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx> > > --- > > fs/ext4/inode.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++++++-- > > fs/jbd2/commit.c | 41 ++++++++++++-- > > 2 files changed, 198 insertions(+), 12 deletions(-) > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > index 63355ab..7d87641 100644 > > --- a/fs/ext4/inode.c > > +++ b/fs/ext4/inode.c > > @@ -1606,13 +1606,12 @@ static int ext4_bh_unmapped_or_delay(handle_t *handle, struct buffer_head *bh) > > return !buffer_mapped(bh) || buffer_delay(bh); > > } > > > > -/* FIXME!! only support data=writeback mode */ > > /* > > * get called vi ext4_da_writepages after taking page lock > > * We may end up doing block allocation here in case > > * mpage_da_map_blocks failed to allocate blocks. > > */ > > -static int ext4_da_writepage(struct page *page, > > +static int ext4_da_writeback_writepage(struct page *page, > > struct writeback_control *wbc) > > { > > int ret = 0; > > @@ -1660,6 +1659,61 @@ static int ext4_da_writepage(struct page *page, > > return ret; > > } > > > > +/* > > + * get called vi ext4_da_writepages after taking page lock > > + * We may end up doing block allocation here in case > > + * mpage_da_map_blocks failed to allocate blocks. > > + * > > + * We also get called via journal_submit_inode_data_buffers > > + */ > > +static int ext4_da_ordered_writepage(struct page *page, > > + struct writeback_control *wbc) > > +{ > > + int ret = 0; > > + loff_t size; > > + unsigned long len; > > + handle_t *handle = NULL; > > + struct buffer_head *page_bufs; > > + struct inode *inode = page->mapping->host; > > + > > + handle = ext4_journal_current_handle(); > > + if (!handle) { > > + /* > > + * This can happen when we aren't called via > > + * ext4_da_writepages() but directly (shrink_page_list). > > + * We cannot easily start a transaction here so we just skip > > + * writing the page in case we would have to do so. > > + */ > > + size = i_size_read(inode); > > + > > + page_bufs = page_buffers(page); > > + if (page->index == size >> PAGE_CACHE_SHIFT) > > + len = size & ~PAGE_CACHE_MASK; > > + else > > + len = PAGE_CACHE_SIZE; > > + > > + if (walk_page_buffers(NULL, page_bufs, 0, > > + len, NULL, ext4_bh_unmapped_or_delay)) { > > + /* > > + * We can't do block allocation under > > + * page lock without a handle . So redirty > > + * the page and return. > > + * We may reach here when we do a journal commit > > + * via journal_submit_inode_data_buffers. > > + * If we don't have mapping block we just ignore > > + * them > > + * > > + */ > > + redirty_page_for_writepage(wbc, page); > > + unlock_page(page); > > + return 0; > > + } > > + } > > + > > + ret = block_write_full_page(page, ext4_da_get_block_write, wbc); > > + > > + return ret; > > +} > If you're going to use this writepage() implementation from commit > code, you cannot simply do redirty_page_for_writepage() and bail out > when there's an unmapped buffer. You must write out at least mapped > buffers to satisfy ordering guarantees (think of filesystems with > blocksize < page size). With delalloc is it possible to have a page that have some buffer_heads marked delay ? -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html