Re: [PATCH 00/10] ext4: fix inconsistency since reading old metadata from disk

Jan Kara <jack@xxxxxxx> · Wed, 10 Jun 2020 11:57:39 +0200

On Wed 10-06-20 16:55:15, zhangyi (F) wrote:
> Hi, Jan.
> 
> On 2020/6/9 20:19, Jan Kara wrote> On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
> >>> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
> >>>> Background
> >>>> ==========
> >>>>
> >>>> This patch set point to fix the inconsistency problem which has been
> >>>> discussed and partial fixed in [1].
> >>>>
> >>>> Now, the problem is on the unstable storage which has a flaky transport
> >>>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> >>>> the bad network environment), if we failed to async write metadata in
> >>>> background, the end write routine in block layer will clear the buffer's
> >>>> uptodate flag, but the data in such buffer is actually uptodate. Finally
> >>>> we may read "old && inconsistent" metadata from the disk when we get the
> >>>> buffer later because not only the uptodate flag was cleared but also we
> >>>> do not check the write io error flag, or even worse the buffer has been
> >>>> freed due to memory presure.
> >>>>
> >>>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> >>>> the checkpoint routine will check the write_io_error flag and abort the
> >>>> the journal if detect IO error. And in the journal recover case, the
> >>>> recover code will invoke sync_blockdev() after recover complete, it will
> >>>> also detect IO error and refuse to mount the filesystem.
> >>>>
> >>>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> >>>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> >>>> containing valid data"), but it's not enough.
> >>>
> >>> Before we go and complicate ext4 code like this, I'd like to understand
> >>> what is the desired outcome which doesn't seem to be mentioned here, in the
> >>> commit 7963e5ac90125, or in the discussion you reference. If you have a
> >>> flaky transport that gives you IO errors, IMO it is not a bussiness of the
> >>> filesystem to try to fix that. I just has to make sure it properly reports
> >>
> >> If we meet some IO errors due to the flaky transport, IMO the desired outcome
> >> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
> >> specified, if we set "errors=read-only || panic", we expect ext4 could remount
> >> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
> >> should try best to guarantee the filesystem on disk is consistency, this will
> >> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
> >> for most cases caused by the async error problem I mentioned), so we could
> >> recover the fs automatically in next boot.
> > 
> > Good, so I fully agree with your goals. Let's now talk about how to achieve
> > them :)
> > 
> >> But now, in the case of metadata async writeback, (1) is done in
> >> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
> >> metadata write error, and it also cannot remount the filesystem or invoke panic
> >> immediately. Finally, if we read the metadata on disk and re-write again, it
> >> may lead to on-disk filesystem inconsistency.
> > 
> > Ah, I see. This was the important bit I was missing. And I think the
> > real problem here is that ext4 cannot detect metadata write error from
> > async writeback. So my plan would be to detect metadata write errors early
> > and abort the journal and do appropriate errors=xxx handling. And a
> > relatively simple way how to do that these days would be to use errseq in
> > the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
> > gets incremented whenever there's writeback error in the block device
> > mapping so (probably in ext4_journal_check_start()) we could check whether
> > wb_err is different from the original value we sampled at mount time an if
> > yes, we know metadata writeback error has happened and we trigger the error
> > handling. What do you think?
> > 
> 
> Thanks a lot for your suggestion, this solution looks good to me. But Ithink
> add 'wb_err' checking into ext4_journal_check_start() maybe too early, see below
> race condition (It's just theoretical analysis, I'm not test it):
> 
> ext4_journal_start()
>  ext4_journal_check_start()  <-- pass checking
>                                  |   end_buffer_async_write()
>                                  |    mark_buffer_write_io_error() <-- set b_page
> sb_getblk(bh)   <-- read old data from disk
> ext4_journal_get_write_access(bh)
> modify this bh  <-- modify data and lead to inconsistency
> ext4_handle_dirty_metadata(bh)
> 
> So I guess it may still lead to inconsistency. How about add this checking
> into ext4_journal_get_write_access() ?

Yes, this also occured to me later. Adding the check to
ext4_journal_get_write_access() should be safer.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR