Re: Question: errors=continue behaviour for failed external journal device

Lukáš Czerner <lczerner@xxxxxxxxxx> · Mon, 28 Jul 2014 15:25:52 +0200 (CEST)

On Mon, 28 Jul 2014, Theodore Ts'o wrote:

> Date: Mon, 28 Jul 2014 09:17:42 -0400
> From: Theodore Ts'o <tytso@xxxxxxx>
> To: Lukáš Czerner <lczerner@xxxxxxxxxx>
> Cc: Vlad Dobrotescu <vlad@xxxxxxxxxxxxx>, linux-ext4@xxxxxxxxxxxxxxx
> Subject: Re: Question: errors=continue behaviour for failed external journal
>     device
> 
> On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote:
> > 
> > I very much agree with that, that's why I was quite surprised that I
> > found out recently that this is the default. I was living in the
> > delusion that the default was ERRORS_RO for as long as I can remember.
> > So my question is, should we change it ? This really does not seem
> > like a sane default.
> 
> Yeah, I've been thinking that this would be a good thing to change for
> 1.43.
> 
> The only reason that errors=continue was the default was for
> historical reasons.  I could imagine some system administrators being
> surprised when all of a sudden their production systems start getting
> lots of EROFS errors getting reported by applications.  So I could
> potentially imagine some Help Desks / Support folks at distributions
> not being enthusiastic about such a change.
> 
> Hmm.... we are starting to have some errors where we can allow the
> system to stagger on, even if we need to disallow new allocations in
> some block groups.  I wonder if it is worthwhile to have a "continue
> for correctable errors".  The danger, of course, is that some errors,
> even if they are correctable, (such as freeing a block which is
> already freed), could be a hint that there are other fs corruptions,
> not yet detected, that might lead to data loss if we reboot and fsck,
> or remount readonly right away.  So the question is while there is
> some value, is it worth the added complexity to add an
> "errors=continue-correctable" option?

Right,

I like the idea of the new errors option, even though the name is a
bit long (maybe "auto") which will try the best to continue, but is
allowed to remount read only if we can not recover from that error.

This would however need some work to make it work reliably and most
importantly a fair amount of testing. Though I think it's worth the
work.

-Lukas

> 
> 							- Ted