Re: [PATCH] ext4: Fix entry corruption when disk online and offline frequently

"Theodore Ts'o" <tytso@xxxxxxx> · Fri, 17 May 2019 18:59:40 -0400

On Tue, May 14, 2019 at 12:23:37PM +0800, ZhangXiaoxu wrote:
> I got some errors when I repair an ext4 volume which stacked by an
> iscsi target:
>     Entry 'test60' in / (2) has deleted/unused inode 73750.  Clear?
> It can be reproduced when the network not good enough.
> 
> When I debug this I found ext4 will read entry buffer from disk and
> the buffer is marked with write_io_error.
> 
> If the buffer is marked with write_io_error, it means it already
> wroten to journal, and not checked out to disk. IOW, the journal
> is newer than the data in disk.
> If this journal record 'delete test60', it means the 'test60' still
> on the disk metadata.
> 
> In this case, if we read the buffer from disk successfully and create
> file continue, the new journal record will overwrite the journal
> which record 'delete test60', then the entry corruptioned.
> 
> So, use the buffer rather than read from disk if the buffer marked
> with write_io_error

You've raised a number of issues about how we handle write errors,
especially when they occur due to a flaky transport --- in your case,
due to iSCSI.  As such, your patch isn't wrong, so much as it is
incomplete.

For example, your assumption that if the buffer is marked
write_io_error, it's safe to clear write_io_error and reset
buffer_uptodate assumes that journalling is enabled.  If the file
system does not have the journal, there is no journal to fall back
upon.  For file systems which do have a journal, if you are using a
flaky iSCSI transport, there is no protection from write errors which
occur when the journal is replayed.  (fs/jbd2/recovery.c simply marks
the buffer dirty and allows the writeback code take care of writing
the buffer.)  This means that the buffer could have write_io_error set
due to a failure to write the buffer during recovery, in which case
relying on the journal having a uptodate copy block is invalid.

Also, this patch only patches the ex4_bread() path, which is only used
by directories.  It doesn't deal with metadata reads for allocation
bitmaps or extent tree blocks.  We are doing this hack for inode table
blocks, already; perhaps you got the idea to do this for ext4_bread()
from __ext4_get_inode_loc()?

We could add some kind of callback from the buffer cache layer when an
aysnchronous writeback fails --- or we could use a synchronous write
in the journal recovery code (which would be bad from a performance
perspective, but ignore that for the moment) --- however, what do we
do when we discover that there is an error?  Right now, we do nothing
until we try to read the inode table block (and after your patch,
reading a directory block).  Under memory pressure, though, the data
will get lost and we don't even mark the file system as needing to be
checked.  We could retry the write, but if it's due to a flaky iSCSI
or FC transport, this write could fail yet again --- and then what?

So while I could apply this patch, since it doesn't make things worse,
I want to make sure you are aware that if you have problems with your
iSCSI device, this patch is far from a complete solution.  At the very
least, we should handle reads for other metadata block.  

      	      	   	    	       		  - Ted