Re: kernel BUG at fs/ext4/mballoc.c:1911!

Liu Bo <obuil.liubo@xxxxxxxxx> · Mon, 9 Apr 2018 19:27:35 -0700

On Mon, Apr 9, 2018 at 4:44 PM, Theodore Y. Ts'o <tytso@xxxxxxx> wrote:
> On Mon, Apr 09, 2018 at 10:25:46AM -0700, Liu Bo wrote:
>> > (e) there're errors about reading this bitmap(group 8383) shown in the log,
>> > crash> grep group e4b.txt
>> >   bd_group = 8383
>> >
>> >  however when it comes to BUG_ON(k >= max), reading this bitmap has
>> >  been successful, and it is the inconsistence between ->bb_counters
>> >  and the buddy bitmap that ends up with the crash, but if the buddy
>> >  bitmap was regenerated, bb_counters should match with the buddy
>> >  bitmap.
>
> What probably happened is that the page containing actual allocation
> bitmap was pushed out of memory due to memory pressure.  However, the
> buddy bitmap was still cached in memory.  That's actually quite
> possible since the buddy bitmap will often be referenced more
> frequently than the allocation bitmap (for example, while searching
> for free space of a specific size, and then having that block group
> skipped when it's not available).
>
> Since there was an I/O error reading the allocation bitmap, the buffer
> is not valid.  So it's not surprising that the BUG_ON(k >= max) is
> getting triggered.
>
> It's of course not desirable.  What should happen is that once we
> realize that the allocation bitmap can't be read, we should mark the
> block group as not being eligible for allocations via the
> EXT4_GROUP_INFO_BBITMAP_CORRUT_BIT, to avoid the BUG_ON from
> triggering.
>

Sounds good.

> I'll put it on my TODO list.  Or feel free to try your hand at making
> the change yourself, versus the latest upstream kernel, and send a
> proposed patch to the linux-ext4@xxxxxxxxxxxxxxx mailing list.
>

Sure, I'll try to make a patch.

thanks,
liubo