On Wed, Mar 21, 2018 at 06:52:47PM +1100, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > We recently came across a V4 filesystem causing memory corruption > due to a newly allocated inode being setup twice and being added to > the superblock inode list twice. From code inspection, the only way > this could happen is if a newly allocated inode was not marked as > free on disk (i.e. di_mode wasn't zero). > > Running the metadump on an upstream debug kernel fails during inode > allocation like so: > > XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inode.c, line: 838 > ------------[ cut here ]------------ > kernel BUG at fs/xfs/xfs_message.c:114! > invalid opcode: 0000 [#1] PREEMPT SMP > CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > RIP: 0010:assfail+0x28/0x30 > RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202 > RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000 > RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b > RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000 > R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30 > R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10 > FS: 00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0 > Call Trace: > xfs_ialloc+0x383/0x570 > xfs_dir_ialloc+0x6a/0x2a0 > xfs_create+0x412/0x670 > xfs_generic_create+0x1f7/0x2c0 > ? capable_wrt_inode_uidgid+0x3f/0x50 > vfs_mkdir+0xfb/0x1b0 > SyS_mkdir+0xcf/0xf0 > do_syscall_64+0x73/0x1a0 > entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > Extracting the inode number we crashed on from an event trace and > looking at it with xfs_db: > > xfs_db> inode 184452204 > xfs_db> p > core.magic = 0x494e > core.mode = 0100644 > core.version = 2 > core.format = 2 (extents) > core.nlinkv2 = 1 > core.onlink = 0 > ..... > > Confirms that it is not a free inode on disk. xfs_repair > also trips over this inode: > > ..... > zero length extent (off = 0, fsbno = 0) in ino 184452204 > correcting nextents for inode 184452204 > bad attribute fork in inode 184452204, would clear attr fork > bad nblocks 1 for inode 184452204, would reset to 0 > bad anextents 1 for inode 184452204, would reset to 0 > imap claims in-use inode 184452204 is free, would correct imap > would have cleared inode 184452204 > ..... > disconnected inode 184452204, would move to lost+found > > And so we have a situation where the directory structure and the > inobt thinks the inode is free, but the inode on disk thinks it is > still in use. Where this corruption came from is not possible to > diagnose, but we can detect it and prevent the kernel from oopsing > on lookup. The reproducer now results in: > > $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5} > mkdir: cannot create directory ‘/mnt/scratch/00’: File exists > mkdir: cannot create directory ‘/mnt/scratch/01’: File exists > mkdir: cannot create directory ‘/mnt/scratch/03’: Structure needs cleaning > mkdir: cannot create directory ‘/mnt/scratch/04’: Input/output error > mkdir: cannot create directory ‘/mnt/scratch/05’: Input/output error > .... > > And this corruption shutdown: > > [ 54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not marked free on disk > [ 54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 of file fs/xfs/xfs_trans.c. Caller xfs_create+0x425/0x670 > [ 54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #443 > [ 54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014 > [ 54.852859] Call Trace: > [ 54.853531] dump_stack+0x85/0xc5 > [ 54.854385] xfs_trans_cancel+0x197/0x1c0 > [ 54.855421] xfs_create+0x425/0x670 > [ 54.856314] xfs_generic_create+0x1f7/0x2c0 > [ 54.857390] ? capable_wrt_inode_uidgid+0x3f/0x50 > [ 54.858586] vfs_mkdir+0xfb/0x1b0 > [ 54.859458] SyS_mkdir+0xcf/0xf0 > [ 54.860254] do_syscall_64+0x73/0x1a0 > [ 54.861193] entry_SYSCALL_64_after_hwframe+0x42/0xb7 > [ 54.862492] RIP: 0033:0x7fb73bddf547 > [ 54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 > [ 54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73bddf547 > [ 54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffdaa55449a > [ 54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a8670dd0 > [ 54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 00000000000001ff > [ 54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffdaa553500 > [ 54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1024 of file fs/xfs/xfs_trans.c. Return address = ffffffff814cd050 > [ 54.882790] XFS (loop0): Corruption of in-memory data detected. Shutting down filesystem > [ 54.884597] XFS (loop0): Please umount the filesystem and rectify the problem(s) > > Note that this crash is only possible on v4 filesystemsi or v5 > filesystems mounted with the ikeep mount option. For all other V5 > filesystems, this problem cannot occur because we don't read inodes > we are allocating from disk - we simply overwrite them with the new > inode information. Got a test case for this scenario? :) The patch looks ok, but I'd rather have some sort of reproducer so I can verify that it works (and that scrub will notice the broken state) even if we have to mkfs + db to check it. > Signed-Off-By: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_icache.c | 23 ++++++++++++++++++++++- > 1 file changed, 22 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 1dc37b72b6ea..98b7a4ae15e4 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -484,7 +484,28 @@ xfs_iget_cache_miss( > > trace_xfs_iget_miss(ip); > > - if ((VFS_I(ip)->i_mode == 0) && !(flags & XFS_IGET_CREATE)) { > + > + /* > + * If we are allocating a new inode, then check what was returned is > + * actually a free, empty inode. If we are not allocating an inode, > + * the check we didn't find a free inode. > + */ > + if (flags & XFS_IGET_CREATE) { > + if (VFS_I(ip)->i_mode != 0) { > + xfs_warn(mp, > +"Corruption detected! Free inode 0x%llx not marked free on disk", > + ino); > + error = -EFSCORRUPTED; > + goto out_destroy; > + } > + if (ip->i_d.di_nblocks != 0) { > + xfs_warn(mp, > +"Corruption detected! Free inode 0x%llx has blocks allocated!", > + ino); > + error = -EFSCORRUPTED; > + goto out_destroy; I've a patch out for review that adds a xfs_inode_verifier_error function that spits out a standardized corruption warning, a hex dump of the bad dinode, and tells the user to run repair. This seems like a good candidate for that. --D > + } > + } else if (VFS_I(ip)->i_mode == 0) { > error = -ENOENT; > goto out_destroy; > } > -- > 2.16.1 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html