I've applied this patch with the modified patch summary: ext4: treat buffers with write errors as containing valid data Cheers, - Ted On Fri, May 17, 2019 at 06:59:40PM -0400, Theodore Ts'o wrote: > On Tue, May 14, 2019 at 12:23:37PM +0800, ZhangXiaoxu wrote: > > I got some errors when I repair an ext4 volume which stacked by an > > iscsi target: > > Entry 'test60' in / (2) has deleted/unused inode 73750. Clear? > > It can be reproduced when the network not good enough. > > > > When I debug this I found ext4 will read entry buffer from disk and > > the buffer is marked with write_io_error. > > > > If the buffer is marked with write_io_error, it means it already > > wroten to journal, and not checked out to disk. IOW, the journal > > is newer than the data in disk. > > If this journal record 'delete test60', it means the 'test60' still > > on the disk metadata. > > > > In this case, if we read the buffer from disk successfully and create > > file continue, the new journal record will overwrite the journal > > which record 'delete test60', then the entry corruptioned. > > > > So, use the buffer rather than read from disk if the buffer marked > > with write_io_error > > You've raised a number of issues about how we handle write errors, > especially when they occur due to a flaky transport --- in your case, > due to iSCSI. As such, your patch isn't wrong, so much as it is > incomplete. > > For example, your assumption that if the buffer is marked > write_io_error, it's safe to clear write_io_error and reset > buffer_uptodate assumes that journalling is enabled. If the file > system does not have the journal, there is no journal to fall back > upon. For file systems which do have a journal, if you are using a > flaky iSCSI transport, there is no protection from write errors which > occur when the journal is replayed. (fs/jbd2/recovery.c simply marks > the buffer dirty and allows the writeback code take care of writing > the buffer.) This means that the buffer could have write_io_error set > due to a failure to write the buffer during recovery, in which case > relying on the journal having a uptodate copy block is invalid. > > Also, this patch only patches the ex4_bread() path, which is only used > by directories. It doesn't deal with metadata reads for allocation > bitmaps or extent tree blocks. We are doing this hack for inode table > blocks, already; perhaps you got the idea to do this for ext4_bread() > from __ext4_get_inode_loc()? > > We could add some kind of callback from the buffer cache layer when an > aysnchronous writeback fails --- or we could use a synchronous write > in the journal recovery code (which would be bad from a performance > perspective, but ignore that for the moment) --- however, what do we > do when we discover that there is an error? Right now, we do nothing > until we try to read the inode table block (and after your patch, > reading a directory block). Under memory pressure, though, the data > will get lost and we don't even mark the file system as needing to be > checked. We could retry the write, but if it's due to a flaky iSCSI > or FC transport, this write could fail yet again --- and then what? > > So while I could apply this patch, since it doesn't make things worse, > I want to make sure you are aware that if you have problems with your > iSCSI device, this patch is far from a complete solution. At the very > least, we should handle reads for other metadata block. > > - Ted