(Added Matthew Wilcox to the CC list) On Fri, Jun 15, 2018 at 11:45:46AM -0400, Eric Whitney wrote: > As discussed in this week's ext4 concall, I'm seeing a regression for > generic/347 in the data=journal test case when running xfstests-bld on > 4.17. The test fails with the following output: > > QA output created by 347 > +/sbin/e2fsck: Unknown code ____ 251 while recovering journal of /dev/mapper/thin-vol > > The failure bisects to: > > errseq: Always report a writeback error once (b4678df184b3) > > This patch landed in v4.17-rc4. When it's reverted in v4.17, it's no longer > possible to reproduce the error. The bogus error was casued by a bug in e2fsck/journal.c. Once I fixed sync_blockdev() to return the error using the kernel error return conventions (we convert it back to an errcode_t higher up in the call stack), we get the expected error: /sbin/e2fsck: Input/output error while recovering journal of /dev/mapper/thin-vol The problem is I'm not sure how to fix this in the kernel. In order to keep the postgress folks, we can't just clear wb_err when the dm_thin volume is grown, since there were previous errors. Unlike the loop device case, the dm_thin device isn't getting reset in the test generic/347. More space is added to the dm_thin device so that future writes won't fail, but the previous writes definitely will fail, so there's no obvious fix to dm-thin that won't break Postgres's "interesting" error collection strategy. So I think all we can do is have e2fsck deliberately throw away the error by calling fsync on the block device. The unfortunate thing about this is that this forces a SYNCHRONIZE CACHE command to be sent to the storage device, which is otherwise unnecessary. I'll do this for now (see below), since there appears to be no other alternative, but I wonder if the right answer is we should add a new fnctl() which allows the caller to collect the error and clear wb_err without needing to send CACHE FLUSH to the device. Willy, what do you think? - Ted --- a/lib/ext2fs/unix_io.c +++ b/lib/ext2fs/unix_io.c @@ -569,6 +569,14 @@ static errcode_t unix_open_channel(const char *name, int fd, if (safe_getenv("UNIX_IO_FORCE_BOUNCE")) flags |= IO_FLAG_FORCE_BOUNCE; +#ifdef __linux__ + /* + * We need to make sure any previous errors in the block + * device are thrown away, sigh. Scat, bogus error! Scat!! + */ + (void) fsync(fd); +#endif + retval = ext2fs_get_mem(sizeof(struct struct_io_channel), &io); if (retval) goto cleanup;