Re: Filesystem corruption after unreachable storage

"Theodore Y. Ts'o" <tytso@xxxxxxx> · Tue, 25 Feb 2020 12:23:55 -0500

On Tue, Feb 25, 2020 at 02:19:09PM +0100, Jean-Louis Dupond wrote:
> FYI,
> 
> Just did same test with e2fsprogs 1.45.5 (from buster backports) and kernel
> 5.4.13-1~bpo10+1.
> And having exactly the same issue.
> The VM needs a manual fsck after storage outage.
> 
> Don't know if its useful to test with 5.5 or 5.6?
> But it seems like the issue still exists.

This is going to be a long shot, but if you could try testing with
5.6-rc3, or with this commit cherry-picked into a 5.4 or later kernel:

   commit 8eedabfd66b68a4623beec0789eac54b8c9d0fb6
   Author: wangyan <wangyan122@xxxxxxxxxx>
   Date:   Thu Feb 20 21:46:14 2020 +0800

       jbd2: fix ocfs2 corrupt when clearing block group bits

       I found a NULL pointer dereference in ocfs2_block_group_clear_bits().
       The running environment:
               kernel version: 4.19
               A cluster with two nodes, 5 luns mounted on two nodes, and do some
               file operations like dd/fallocate/truncate/rm on every lun with storage
               network disconnection.

       The fallocate operation on dm-23-45 caused an null pointer dereference.
       ...

... it would be interesting to see if fixes things for you.  I can't
guarantee that it will, but the trigger of the failure which wangyan
found is very similar indeed.

Thanks,

						- Ted