Re: [patch 0/9] writeback data integrity and other fixes (take 3)

Nick Piggin <npiggin@xxxxxxx> · Wed, 29 Oct 2008 05:00:14 +0100

On Wed, Oct 29, 2008 at 02:16:45PM +1100, Dave Chinner wrote:
> On Wed, Oct 29, 2008 at 01:16:53AM +0100, Nick Piggin wrote:
> > On Wed, Oct 29, 2008 at 09:27:46AM +1100, Dave Chinner wrote:
> > > On Tue, Oct 28, 2008 at 04:39:53PM +0100, Nick Piggin wrote:
> > > Yes - that's coming from end_buffer_async_write() when an error is
> > > reported in bio completion. This does:
> > > 
> > >  465                 set_bit(AS_EIO, &page->mapping->flags);
> > >  466                 set_buffer_write_io_error(bh);
> > >  467                 clear_buffer_uptodate(bh);
> > >  468                 SetPageError(page);
> > > 
> > > Hmmmm - do_fsync() calls filemap_fdatawait() which ends up in
> > > wait_on_page_writeback_range() which is appears to be checking the
> > > mapping flags for errors. I wonder why that error is not being
> > > propagated then? AFAICT both XFS and the fsync code are doing the
> > > right thing but somewhere the error has gone missing...
> > 
> > This one-liner has it reporting EIO errors like a champion. I
> > don't know if you'll actually need to put this into the
> > linux API layer or not, but anyway the root cause of the problem
> > AFAIKS is this.
> > --
> > 
> > XFS: fix fsync errors not being propogated back to userspace.
> > ---
> > Index: linux-2.6/fs/xfs/xfs_vnodeops.c
> > ===================================================================
> > --- linux-2.6.orig/fs/xfs/xfs_vnodeops.c
> > +++ linux-2.6/fs/xfs/xfs_vnodeops.c
> > @@ -715,7 +715,7 @@ xfs_fsync(
> >  	/* capture size updates in I/O completion before writing the inode. */
> >  	error = filemap_fdatawait(VFS_I(ip)->i_mapping);
> >  	if (error)
> > -		return XFS_ERROR(error);
> > +		return XFS_ERROR(-error);
> 
> <groan>
> 
> Yeah, that'd do it. Good catch. I can't believe I recently fixed a
> bug that touched these lines of code without noticing the inversion.
> Sometimes I wonder if we should just conver the entire of XFS to
> return negative errors - mistakes in handling negative error numbers
> in the core XFS code happen all the time.

Heh, yeah it seems a bit tricky. BTW. feel free to add
Signed-off-by: Nick Piggin <npiggin@xxxxxxx>
if you would like to merge it, or alternatively if you fix it another
way then I can rerun the test in 2 minutes if it is not 100% obviously
correct.

BTW. XFS seems to really be a lot more consistent than ext3 (ordered)
in reporting EIO with this change in place. I don't know why exactly,
but in ext3 I can see hundreds or thousands of failures injected in this
manner:

FAULT_INJECTION: forcing a failure
Call Trace:
9ebe17f8:  [<6019f34e>] random32+0xe/0x20
9ebe1808:  [<601a3329>] should_fail+0xd9/0x130
9ebe1838:  [<6018d234>] generic_make_request+0x304/0x4e0
9ebe1848:  [<600622e1>] mempool_alloc+0x51/0x130
9ebe18f8:  [<6018e82f>] submit_bio+0x4f/0xe0
9ebe1948:  [<600a8969>] submit_bh+0xd9/0x130
9ebe1978:  [<600aab08>] __block_write_full_page+0x198/0x310
9ebe1988:  [<600a8487>] alloc_buffer_head+0x37/0x50
9ebe19a0:  [<600f6ec0>] ext3_get_block+0x0/0x110
9ebe19d8:  [<600f6ec0>] ext3_get_block+0x0/0x110
9ebe19e8:  [<600aad48>] block_write_full_page+0xc8/0x100
9ebe1a38:  [<600f8951>] ext3_ordered_writepage+0xd1/0x180
9ebe1a48:  [<6006616b>] set_page_dirty+0x4b/0xd0
9ebe1a88:  [<600660f5>] __writepage+0x15/0x40
9ebe1aa8:  [<60066755>] write_cache_pages+0x255/0x470
9ebe1ac0:  [<600660e0>] __writepage+0x0/0x40
9ebe1bc8:  [<60066990>] generic_writepages+0x20/0x30
9ebe1bd8:  [<600669db>] do_writepages+0x3b/0x40
9ebe1bf8:  [<6006006c>] __filemap_fdatawrite_range+0x5c/0x70
9ebe1c58:  [<6006028a>] filemap_fdatawrite+0x1a/0x20
9ebe1c68:  [<600a7a15>] do_fsync+0x45/0xe0
9ebe1c98:  [<6007792b>] sys_msync+0x14b/0x1d0
9ebe1cf8:  [<60019a70>] handle_syscall+0x50/0x80
9ebe1d18:  [<6002a10f>] userspace+0x44f/0x510
9ebe1fc8:  [<60016792>] fork_handler+0x62/0x70

Buffer I/O error on device ram0, logical block 8228
lost page write due to I/O error on ram0

But eventually the filesystem aborts when we get a journal writing error,
and sometimes I see a single EIO to msync, sometimes not (cc ext3 list).

Wheras XFS reports a lot more EIOs as it goes...

If ext3 guys have any suggestions, I could test them out.

> FWIW, the core issue here is that we've got to do the
> filemap_fdatawait() call in the ->fsync method because ->fsync
> gets called before we've waited for the data I/O to complete.
> XFS updates inode state on I/O completion, so we *must* wait
> for data I/O to complete before logging the inode changes. I
> think btrfs has the same problem....

Interesting. Does that mean you can do without the final filemap_fdatawait?
Can you do the first fdatawait without i_mutex held?

There was some talk IIRC about moving all this logic into the filesystem
to provide more flexibility here. If there is still enough interest,
we should get that moving...

> Thanks again, Nick.

No problem.

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html