Re: [syzbot] [ext4?] KASAN: slab-out-of-bounds Read in ext4_group_desc_csum

"Theodore Ts'o" <tytso@xxxxxxx> · Sat, 29 Apr 2023 22:55:12 -0400

On Mon, Mar 13, 2023 at 12:57:28PM +0100, Jan Kara wrote:
> > 
> > I can now explain how the contents of the super block of the buffer get
> > corrupted. After the ext4 fs is mounted to the target ("./bus"), the
> > reproducer maps 6MB of data starting at offset 0 in the target's file
> > ("./bus"), then it starts overriding the data with something else, by
> > using memcpy, memset, individual byte inits. Does that mean that we
> > shouldn't rely on the contents of the super block in the buffer after we
> > mount the file system?

It's not reasonable to avoid relying on the contents of the superblock
under all cases.  HOWEVER, sometimes it might make sense.  See below...

> So the result is that the reproducer modified the block device while it is
> mounted by the filesystem. We know cases like this can crash the kernel and
> it is inherently difficult to fix. We have to trust the buffer cache
> contents as otherwise the performance will be unacceptable. For historical
> reasons we also have to allow modifications of buffer cache while ext4 is
> mounted because tune2fs uses this to e.g. update the label of a mounted
> filesystem.

I've been taking a look at some of the syzkaller reports for ext4, and
there are a number of sysbot reports which are caused by the
reproducer messing with the block device while the file system is
mounted, including:

KASAN: slab-out-of-bounds Read in get_max_inline_xattr_value_size
    https://syzkaller.appspot.com/bug?id=731e35eeed762019e385baa96953d9ec8eb63c10
    syzbot+1966db24521e5f6e23f7@xxxxxxxxxxxxxxxxxxxxxxxxx

KASAN: slab-use-after-free Read in ext4_convert_inline_data_nolock
    https://syzkaller.appspot.com/bug?id=434a92f091e845da1ba387fb93f186412e30e35c
    syzbot+db6caad9ebd2c8022b41@xxxxxxxxxxxxxxxxxxxxxxxxx

kernel BUG in ext4_get_group_info
    https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220
    syzbot+e2efa3efc15a1c9e95c3@xxxxxxxxxxxxxxxxxxxxxxxxx

(The easiest way to find them is to look at the Syzkaller reproducer,
and look for bind mounts of /dev/loopN to "./bus".  It's much less
painful than trying to find it in the C reproducer text file.)

As Jan has pointed out, we can't disable writing to the block device,
because this would break real-world system administrator workloads,
including the ability to set the label and uuid, use tune2fs to set
various parameters on the file system, etc.  We do have ioctls that
allow for setting the label and uuid, and in maybe ten years we should
be able to get to the point where all of the enterprise kernels still
supported by Red Hat, SuSE, etc. can be guaranteed to support all of
the necessary ioctls --- some of which still need to be implemented.

So this will take a *while*, and especially while senior management
types at many companies are announcing layoffs, cutting travel, and
talking about "year of efficiency" and "sharpening focus"[1], I don't
think we'll have much luck getting funded head count to impement
missing ioctls, other than slowly, on volunteer time, and maybe as
intern projects.  So what should we do in the intervening
year(s)/decade?  I'd propose the following priorities.

[1]  while simultaneously whining about "kernel (security) disasters"
and blaming the upstream developers.  Sigh...

>From a quality of implementation (QoI) perspective, once we've
determined that it's caused by "messing with the block device while it
is mounted", if it just causes a denial of service attack, these should
be the lowest priority.  However, if there is an easy way to fix it,
AND if it fixes other issues OR makes the kernel smaller and/or more
efficient, I won't turn away those kind of proposed patches.

For example, in the case of the syzkaller report discussed in this
thread ("KASAN: slab-out-of-bounds Read in ext4_group_desc_csum"),
Tudor's proposed change of replacing

	le16_to_cpu(sbi->s_es->s_desc_size)

with
	sbi->s_desc_size

will actually reduce ext4's compiled text size, and make the code more
efficient (we remove an extra indirect reference and a potential byte
swap on big endian systems), and there is no downside.  In fact, in
many places we use sbi->s_desc_size in preference to accessing the
s_es variable; that's why we put it in the ext4_super_info structure
in the first place!  So sure, we should make this change, and if it
avoids a potential KASAN / syzkaller failure, that's a bonus.

Slightly higher in priority are those bugs which might allow kernel
state to be leaked ("kernel confidentiality").  Of course, if the
process with root access can write to the block device, it can almost
certainly read that block device as well; but there might be critical
bits of kernel state (for example, an RSA private key), in kernel
memory, that if leaked, it would be sad.

The highest priority would go to those where root access might be
leveraged to allow arbitrary code to be executed in kernel mode
("kernel integrity") --- which is unfortunate because it allows root
access to breach lockdown security.

Of course, since many of the people working syzbot reports for ext4
are volunteers and/or company engineers working on their own unfunded
personal time, we still can't *guarantee* anything.  In addition, I'd
still reject a patch which had an overly expensive CPU or memory
overhead with a "try harder".  So it would still be on a case-by-case
basis whether such patches would be accepted.  After all, some
business leaders have elected to disable some mitigations for
Spectre/Meltdown and related attacks because they were Too Damn
Expensive.  I reserve the right as upstream maintainer to make similar
judgement calls.

						- Ted

P.S.  As another example, over the weekend, I've been working on some
patches in the works to address the third syzbot report listed above
("kernel BUG in ext4_get_group_info").  When I evaluated these
patches, I found that they increased the compiled text size by 2k when
I added the additional checks, none of which were in hot paths.  But
after I un-inlined ext4_get_group_info(), the compiled test size
shrunk by 4k, for a net 2k byte *savings* in compiled kernel text
memory.

We already had similar checks and calls to ext4_error() in
ext4_get_group_desc(); this patch was just added a similar conditional
call to ext4_error() to ext4_get_group_info() --- and changing the
callers of that function to check for a NULL return from that
function.  While this change only prevents a denial of service attack,
in my judgement the QoI benefits outweigh the costs.