On Mon, Mar 13, 2023 at 12:57:28PM +0100, Jan Kara wrote: > > > > I can now explain how the contents of the super block of the buffer get > > corrupted. After the ext4 fs is mounted to the target ("./bus"), the > > reproducer maps 6MB of data starting at offset 0 in the target's file > > ("./bus"), then it starts overriding the data with something else, by > > using memcpy, memset, individual byte inits. Does that mean that we > > shouldn't rely on the contents of the super block in the buffer after we > > mount the file system? It's not reasonable to avoid relying on the contents of the superblock under all cases. HOWEVER, sometimes it might make sense. See below... > So the result is that the reproducer modified the block device while it is > mounted by the filesystem. We know cases like this can crash the kernel and > it is inherently difficult to fix. We have to trust the buffer cache > contents as otherwise the performance will be unacceptable. For historical > reasons we also have to allow modifications of buffer cache while ext4 is > mounted because tune2fs uses this to e.g. update the label of a mounted > filesystem. I've been taking a look at some of the syzkaller reports for ext4, and there are a number of sysbot reports which are caused by the reproducer messing with the block device while the file system is mounted, including: KASAN: slab-out-of-bounds Read in get_max_inline_xattr_value_size https://syzkaller.appspot.com/bug?id=731e35eeed762019e385baa96953d9ec8eb63c10 syzbot+1966db24521e5f6e23f7@xxxxxxxxxxxxxxxxxxxxxxxxx KASAN: slab-use-after-free Read in ext4_convert_inline_data_nolock https://syzkaller.appspot.com/bug?id=434a92f091e845da1ba387fb93f186412e30e35c syzbot+db6caad9ebd2c8022b41@xxxxxxxxxxxxxxxxxxxxxxxxx kernel BUG in ext4_get_group_info https://syzkaller.appspot.com/bug?id=69b28112e098b070f639efb356393af3ffec4220 syzbot+e2efa3efc15a1c9e95c3@xxxxxxxxxxxxxxxxxxxxxxxxx (The easiest way to find them is to look at the Syzkaller reproducer, and look for bind mounts of /dev/loopN to "./bus". It's much less painful than trying to find it in the C reproducer text file.) As Jan has pointed out, we can't disable writing to the block device, because this would break real-world system administrator workloads, including the ability to set the label and uuid, use tune2fs to set various parameters on the file system, etc. We do have ioctls that allow for setting the label and uuid, and in maybe ten years we should be able to get to the point where all of the enterprise kernels still supported by Red Hat, SuSE, etc. can be guaranteed to support all of the necessary ioctls --- some of which still need to be implemented. So this will take a *while*, and especially while senior management types at many companies are announcing layoffs, cutting travel, and talking about "year of efficiency" and "sharpening focus"[1], I don't think we'll have much luck getting funded head count to impement missing ioctls, other than slowly, on volunteer time, and maybe as intern projects. So what should we do in the intervening year(s)/decade? I'd propose the following priorities. [1] while simultaneously whining about "kernel (security) disasters" and blaming the upstream developers. Sigh... >From a quality of implementation (QoI) perspective, once we've determined that it's caused by "messing with the block device while it is mounted", if it just causes a denial of service attack, these should be the lowest priority. However, if there is an easy way to fix it, AND if it fixes other issues OR makes the kernel smaller and/or more efficient, I won't turn away those kind of proposed patches. For example, in the case of the syzkaller report discussed in this thread ("KASAN: slab-out-of-bounds Read in ext4_group_desc_csum"), Tudor's proposed change of replacing le16_to_cpu(sbi->s_es->s_desc_size) with sbi->s_desc_size will actually reduce ext4's compiled text size, and make the code more efficient (we remove an extra indirect reference and a potential byte swap on big endian systems), and there is no downside. In fact, in many places we use sbi->s_desc_size in preference to accessing the s_es variable; that's why we put it in the ext4_super_info structure in the first place! So sure, we should make this change, and if it avoids a potential KASAN / syzkaller failure, that's a bonus. Slightly higher in priority are those bugs which might allow kernel state to be leaked ("kernel confidentiality"). Of course, if the process with root access can write to the block device, it can almost certainly read that block device as well; but there might be critical bits of kernel state (for example, an RSA private key), in kernel memory, that if leaked, it would be sad. The highest priority would go to those where root access might be leveraged to allow arbitrary code to be executed in kernel mode ("kernel integrity") --- which is unfortunate because it allows root access to breach lockdown security. Of course, since many of the people working syzbot reports for ext4 are volunteers and/or company engineers working on their own unfunded personal time, we still can't *guarantee* anything. In addition, I'd still reject a patch which had an overly expensive CPU or memory overhead with a "try harder". So it would still be on a case-by-case basis whether such patches would be accepted. After all, some business leaders have elected to disable some mitigations for Spectre/Meltdown and related attacks because they were Too Damn Expensive. I reserve the right as upstream maintainer to make similar judgement calls. - Ted P.S. As another example, over the weekend, I've been working on some patches in the works to address the third syzbot report listed above ("kernel BUG in ext4_get_group_info"). When I evaluated these patches, I found that they increased the compiled text size by 2k when I added the additional checks, none of which were in hot paths. But after I un-inlined ext4_get_group_info(), the compiled test size shrunk by 4k, for a net 2k byte *savings* in compiled kernel text memory. We already had similar checks and calls to ext4_error() in ext4_get_group_desc(); this patch was just added a similar conditional call to ext4_error() to ext4_get_group_info() --- and changing the callers of that function to check for a NULL return from that function. While this change only prevents a denial of service attack, in my judgement the QoI benefits outweigh the costs.