[Bug 200753] write I/O error for inode structure leads to operation failure without any warning or error

bugzilla-daemon@xxxxxxxxxxxxxxxxxxx · Tue, 07 Aug 2018 13:44:58 +0000

https://bugzilla.kernel.org/show_bug.cgi?id=200753

--- Comment #11 from Theodore Tso (tytso@xxxxxxx) ---
On Tue, Aug 07, 2018 at 03:36:25AM +0000, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx
wrote:
> Now, I probably would expect to see some errors in dmesg if for example inode
> flushing fails at unmount time, though.  It's not strictly a bug to not log
> an
> error at the filesystem level, but it's probably desirable.  Ted can probably
> speak to this better than I can.

In practice we do, but it's coming from lower levels of the storage
stack.  We'll of course log errors writing into the journal, but once
they metadata updates are logged to the journal, they get written back
by the buffer cache writeback functions.  Immediately before the
unmount they will be flushed out by the jbd2 layer in
fs/jbd2/checkpoint.c, using fs/buffer.c's write_dirty_buffer() with a
REQ_SYNC flag, and I/O errors get logged via buffer_io_error().

In practice, the media write errors are logged by the device drivers,
so the users have some idea that Bad Stuff is happening --- assuming
they even look at the dmesg layer at all, of course.

One could imagine an enhancement to the file system to teach it to not
use the generic buffer cache writeback functions, and instead submit
I/O requests with custom callback functions for metadata I/O so in
case of an error, there would be an ext4-level error that would
explain that writeback to inode table block XXX, affecting inodes
YYY-ZZZ failed.  And if someone submitted such a patch, I'd consider
it for inclusion, assuming the patch was clean, correct, and didn't
introduce long-term maintenance burdens.

However, at $WORK, we have a custom set of changes so that all file
system errors as well as media errors from the SCSI/SATA layer get
sent to a userspace daemon via netlink, and as necessary, the disk
will be automatically sent to a repair workflow.  The repair workflow
would then tell the cluster file system to stop using that disk, and
then confirm that the bad block rediction pool was able to kick in
correctly, or flagging the drive to be sent to a hardware operations
team to replace the disk drive, etc.

(The main reason why we haven't sent it upstream is that the patch as
it stands today is a bit of an ugly kludge, and would have to be
rewritten to be a better structured error kernel->userspace reporting
mechanism --- either for the storage stack in general, or for the
whole kernel.  Alas, no one has had the time or energy to deal with
the almost certain bike-shedding party that would ensue after
proposing such a new feature.  :-)

So I don't have much motivation to fix up something to log explanatory
error messages from the file system level, when the device driver
errors in dmesg are in practice quite sufficient for most users.  Even
without a custom netlink patch, you can just scrape dmesg.  An example
of such a userspace approach can be found here:

        https://github.com/kubernetes/node-problem-detector

In practice, most practical systems don't need to know exactly what
file system metadata I/O block had problems.  We just need to know
when the disk drive has started developing errors, and when it has,
whether we can restore the disk drive to be a 100% functional storage
device, so it can be safely and sanely used by the file system layer
--- or whether it's time to replace it with a working drive.

I do agree with Eric that it would be a nice-to-have if ext4 were to
log messages if an inode table block or a bitmap allocation block ran
into errors when being flushed out, at unmount or by the kernel's
writeback threads.  But it's really only a nice-to-have.

Patches gratefully accepted....

-- 
You are receiving this mail because:
You are watching the assignee of the bug.