Re: [PATCH 1/2] xfs: catch inode allocation state mismatch corruption

"Darrick J. Wong" <darrick.wong@xxxxxxxxxx> · Wed, 21 Mar 2018 10:06:40 -0700

On Wed, Mar 21, 2018 at 06:52:47PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@xxxxxxxxxx>
> 
> We recently came across a V4 filesystem causing memory corruption
> due to a newly allocated inode being setup twice and being added to
> the superblock inode list twice. From code inspection, the only way
> this could happen is if a newly allocated inode was not marked as
> free on disk (i.e. di_mode wasn't zero).
> 
> Running the metadump on an upstream debug kernel fails during inode
> allocation like so:
> 
> XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inode.c, line: 838
> ------------[ cut here ]------------
> kernel BUG at fs/xfs/xfs_message.c:114!
> invalid opcode: 0000 [#1] PREEMPT SMP
> CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> RIP: 0010:assfail+0x28/0x30
> RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
> RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
> RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
> RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
> R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
> FS:  00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
> Call Trace:
>  xfs_ialloc+0x383/0x570
>  xfs_dir_ialloc+0x6a/0x2a0
>  xfs_create+0x412/0x670
>  xfs_generic_create+0x1f7/0x2c0
>  ? capable_wrt_inode_uidgid+0x3f/0x50
>  vfs_mkdir+0xfb/0x1b0
>  SyS_mkdir+0xcf/0xf0
>  do_syscall_64+0x73/0x1a0
>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Extracting the inode number we crashed on from an event trace and
> looking at it with xfs_db:
> 
> xfs_db> inode 184452204
> xfs_db> p
> core.magic = 0x494e
> core.mode = 0100644
> core.version = 2
> core.format = 2 (extents)
> core.nlinkv2 = 1
> core.onlink = 0
> .....
> 
> Confirms that it is not a free inode on disk. xfs_repair
> also trips over this inode:
> 
> .....
> zero length extent (off = 0, fsbno = 0) in ino 184452204
> correcting nextents for inode 184452204
> bad attribute fork in inode 184452204, would clear attr fork
> bad nblocks 1 for inode 184452204, would reset to 0
> bad anextents 1 for inode 184452204, would reset to 0
> imap claims in-use inode 184452204 is free, would correct imap
> would have cleared inode 184452204
> .....
> disconnected inode 184452204, would move to lost+found
> 
> And so we have a situation where the directory structure and the
> inobt thinks the inode is free, but the inode on disk thinks it is
> still in use. Where this corruption came from is not possible to
> diagnose, but we can detect it and prevent the kernel from oopsing
> on lookup. The reproducer now results in:
> 
> $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
> mkdir: cannot create directory ‘/mnt/scratch/00’: File exists
> mkdir: cannot create directory ‘/mnt/scratch/01’: File exists
> mkdir: cannot create directory ‘/mnt/scratch/03’: Structure needs cleaning
> mkdir: cannot create directory ‘/mnt/scratch/04’: Input/output error
> mkdir: cannot create directory ‘/mnt/scratch/05’: Input/output error
> ....
> 
> And this corruption shutdown:
> 
> [   54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not marked free on disk
> [   54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 of file fs/xfs/xfs_trans.c.  Caller xfs_create+0x425/0x670
> [   54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #443
> [   54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> [   54.852859] Call Trace:
> [   54.853531]  dump_stack+0x85/0xc5
> [   54.854385]  xfs_trans_cancel+0x197/0x1c0
> [   54.855421]  xfs_create+0x425/0x670
> [   54.856314]  xfs_generic_create+0x1f7/0x2c0
> [   54.857390]  ? capable_wrt_inode_uidgid+0x3f/0x50
> [   54.858586]  vfs_mkdir+0xfb/0x1b0
> [   54.859458]  SyS_mkdir+0xcf/0xf0
> [   54.860254]  do_syscall_64+0x73/0x1a0
> [   54.861193]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> [   54.862492] RIP: 0033:0x7fb73bddf547
> [   54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
> [   54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73bddf547
> [   54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffdaa55449a
> [   54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a8670dd0
> [   54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 00000000000001ff
> [   54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffdaa553500
> [   54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1024 of file fs/xfs/xfs_trans.c.  Return address = ffffffff814cd050
> [   54.882790] XFS (loop0): Corruption of in-memory data detected.  Shutting down filesystem
> [   54.884597] XFS (loop0): Please umount the filesystem and rectify the problem(s)
> 
> Note that this crash is only possible on v4 filesystemsi or v5
> filesystems mounted with the ikeep mount option. For all other V5
> filesystems, this problem cannot occur because we don't read inodes
> we are allocating from disk - we simply overwrite them with the new
> inode information.

Got a test case for this scenario? :)

The patch looks ok, but I'd rather have some sort of reproducer so I can
verify that it works (and that scrub will notice the broken state) even
if we have to mkfs + db to check it.

> Signed-Off-By: Dave Chinner <dchinner@xxxxxxxxxx>
> ---
>  fs/xfs/xfs_icache.c | 23 ++++++++++++++++++++++-
>  1 file changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 1dc37b72b6ea..98b7a4ae15e4 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -484,7 +484,28 @@ xfs_iget_cache_miss(
>  
>  	trace_xfs_iget_miss(ip);
>  
> -	if ((VFS_I(ip)->i_mode == 0) && !(flags & XFS_IGET_CREATE)) {
> +
> +	/*
> +	 * If we are allocating a new inode, then check what was returned is
> +	 * actually a free, empty inode. If we are not allocating an inode,
> +	 * the check we didn't find a free inode.
> +	 */
> +	if (flags & XFS_IGET_CREATE) {
> +		if (VFS_I(ip)->i_mode != 0) {
> +			xfs_warn(mp,
> +"Corruption detected! Free inode 0x%llx not marked free on disk",
> +				ino);
> +			error = -EFSCORRUPTED;
> +			goto out_destroy;
> +		}
> +		if (ip->i_d.di_nblocks != 0) {
> +			xfs_warn(mp,
> +"Corruption detected! Free inode 0x%llx has blocks allocated!",
> +				ino);
> +			error = -EFSCORRUPTED;
> +			goto out_destroy;

I've a patch out for review that adds a xfs_inode_verifier_error
function that spits out a standardized corruption warning, a hex dump of
the bad dinode, and tells the user to run repair.  This seems like a
good candidate for that.

--D

> +		}
> +	} else if (VFS_I(ip)->i_mode == 0) {
>  		error = -ENOENT;
>  		goto out_destroy;
>  	}
> -- 
> 2.16.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html