Re: Process for severe early stable bugs?

Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> · Sat, 8 Dec 2018 12:56:29 +0100

On Fri, Dec 07, 2018 at 04:33:10PM -0800, Laura Abbott wrote:
> The latest file system corruption issue (Nominally fixed by
> ffe81d45322c ("blk-mq: fix corruption with direct issue") later
> fixed by c616cbee97ae ("blk-mq: punt failed direct issue to dispatch
> list")) brought a lot of rightfully concerned users asking about
> release schedules. 4.18 went EOL on Nov 21 and Fedora rebased to
> 4.19.3 on Nov 23. When the issue started getting visibility,
> users were left with the option of running known EOL 4.18.x
> kernels or running a 4.19 series that could corrupt their
> data. Admittedly, the risk of running the EOL kernel was pretty
> low given how recent it was, but it's still not a great look
> to tell people to run something marked EOL.
> 
> I'm wondering if there's anything we can do to make things easier
> on kernel consumers. Bugs will certainly happen but it really
> makes it hard to push the "always run the latest stable" narrative
> if there isn't a good fallback when things go seriously wrong. I
> don't actually have a great proposal for a solution here other than
> retroactively bringing back 4.18 (which I don't think Greg would
> like) but I figured I should at least bring it up.

A nice step forward would have been if someone could have at least
_told_ the stable maintainer (i.e. me) that there was such a serious bug
out there.  That didn't happen here and I only found out about it
accidentally by happening to talk to a developer who was on the bugzilla
thread at a totally random meeting last Wednesday.

There was also not an email thread that I could find once I found out
about the issue.  By that time the bug was fixed and all I could do was
wait for it to hit Linus's tree (and even then, I had to wait for the
fix to the fix...)  If I had known about it earlier, I would have
reverted the change that caused this.

I would start by looking at how we at least notify people of major
issues like this.  Yes it was complex and originally blamed on both
btrfs and ext4 changes, and it was dependant on using a brand-new
.config file which no kernel developers use (and it seems no distro uses
either, which protected Fedora and others at the least!)

There will always be bugs and exceptions and personally I think that the
rarity of this one was such that it is a rare event and adding the
requirement that I have to maintain more than one set of stable trees
for longer isn't going to happen (yeah, I know you said you didn't
expect that, but I know others mentioned it to me...)

So I don't know what to say here other than please tell me about major
issues like this and don't rely on me getting lucky and hearing about it
on my own.

thanks,

greg k-h