Re: kernel BUG at fs/ext4/mballoc.c:1911!

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Mon, 9 Apr 2018 19:44:41 -0400

On Mon, Apr 09, 2018 at 10:25:46AM -0700, Liu Bo wrote:
> > (e) there're errors about reading this bitmap(group 8383) shown in the log,
> > crash> grep group e4b.txt
> >   bd_group = 8383
> >
> >  however when it comes to BUG_ON(k >= max), reading this bitmap has
> >  been successful, and it is the inconsistence between ->bb_counters
> >  and the buddy bitmap that ends up with the crash, but if the buddy
> >  bitmap was regenerated, bb_counters should match with the buddy
> >  bitmap.

What probably happened is that the page containing actual allocation
bitmap was pushed out of memory due to memory pressure.  However, the
buddy bitmap was still cached in memory.  That's actually quite
possible since the buddy bitmap will often be referenced more
frequently than the allocation bitmap (for example, while searching
for free space of a specific size, and then having that block group
skipped when it's not available).

Since there was an I/O error reading the allocation bitmap, the buffer
is not valid.  So it's not surprising that the BUG_ON(k >= max) is
getting triggered.

It's of course not desirable.  What should happen is that once we
realize that the allocation bitmap can't be read, we should mark the
block group as not being eligible for allocations via the
EXT4_GROUP_INFO_BBITMAP_CORRUT_BIT, to avoid the BUG_ON from
triggering.

I'll put it on my TODO list.  Or feel free to try your hand at making
the change yourself, versus the latest upstream kernel, and send a
proposed patch to the linux-ext4@xxxxxxxxxxxxxxx mailing list.

Cheers,

						- Ted