Re: [Bug 215783] New: kernel NULL pointer dereference and general protection fault in fs/xfs/xfs_buf_item_recover.c: xlog_recover_do_reg_buffer() when mount a corrupted image

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 1 Apr 2022 10:17:23 +1100

On Fri, Apr 01, 2022 at 08:35:39AM +1100, Dave Chinner wrote:
> On Thu, Mar 31, 2022 at 08:07:08PM +0000, bugzilla-daemon@xxxxxxxxxx wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=215783
> > - Overview 
> > kernel NULL pointer dereference and general protection fault in
> > fs/xfs/xfs_buf_item_recover.c:xlog_recover_do_reg_buffer() when mount a
> > corrupted image, sometimes cause kernel hang
> > 
> > - Reproduce 
> > tested on kernel 5.17.1, 5.15.32
> > 
> > $ mkdir mnt
> > $ unzip tmp7.zip
> > $ ./mount.sh xfs 7  ##NULL pointer derefence
> > or
> > $ sudo mount -t xfs tmp7.img mnt ##general protection fault
> > 
> > - Kernel dump
> 
> You've now raised 4 bugs that all look very similar and are quite
> possibly all caused by the same corruption vector.
> Please do some triage on the failure to identify the
> source of the corruption that trigger this failure.

Ok, the log has been intentionally corrupted in a way that does not
happen in the real world. i.e.  The iclog header at the tail of the
log has had the CRC zeroed, so CRC checking for media bit corruption
has been intentionally bypassed by the tool that corrupted the log.

The first item is a superblock buffer item, which contains 2
regions; a buf log item and a 384 byte long region containing the
logged superblock data.

However, the buf log item has been screwed with to say that it has 8
regions rather than 2, and so when recovery goes to recovery the
third region that doesn't exist, it falls off the end of the
allocated transaction buffer.

We only ever write iclogs with CRCs in them (except for mkfs when it
writes an unmount record to intialise the log), so bit corruptions
like this will get caught before we even started log recovery in
production systems.

We've got enough issues with actual log recovery bugs that we don't
need to be overloaded by being forced to play whack-a-mole with
malicious corruptions that *will not happen in the real world*
because "security!".

Looking at the crash locations for the other bugs, they are all
going to be the same thing - you've corrupted the vector index in
the log item and so they all fall off the end of the buffer because
the index no longer matches the actual contents of the log item.

vvvv THIS vvvv

> If you are going to run some scripted tool to randomly corrupt the
> filesystem to find failures, then you have an ethical and moral
> responsibility to do some of the work to narrow down and identify
> the cause of the failure, not just throw them at someone to do all
> the work.

^^^^ THIS ^^^^^

Please confirm your other reports have the same root cause and close
them if they are. If not, please point us to the unique corruption
in the log that causes the failure.

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx