On 11/7/16 8:57 AM, Brian Foster wrote: > On Fri, Nov 04, 2016 at 09:24:16PM -0500, Eric Sandeen wrote: >> There have been several reports over the years of NULL pointer >> dereferences in xfs_trans_log_inode during xfs_fsr processes, >> when the process is doing an fput and tearing down extents >> on the temporary inode, something like: >> >> BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 >> PID: 29439 TASK: ffff880550584fa0 CPU: 6 COMMAND: "xfs_fsr" >> [exception RIP: xfs_trans_log_inode+0x10] >> #9 [ffff8800a57bbbe0] xfs_bunmapi at ffffffffa037398e [xfs] >> #10 [ffff8800a57bbce8] xfs_itruncate_extents at ffffffffa0391b29 [xfs] >> #11 [ffff8800a57bbd88] xfs_inactive_truncate at ffffffffa0391d0c [xfs] >> #12 [ffff8800a57bbdb8] xfs_inactive at ffffffffa0392508 [xfs] >> #13 [ffff8800a57bbdd8] xfs_fs_evict_inode at ffffffffa035907e [xfs] >> #14 [ffff8800a57bbe00] evict at ffffffff811e1b67 >> #15 [ffff8800a57bbe28] iput at ffffffff811e23a5 >> #16 [ffff8800a57bbe58] dentry_kill at ffffffff811dcfc8 >> #17 [ffff8800a57bbe88] dput at ffffffff811dd06c >> #18 [ffff8800a57bbea8] __fput at ffffffff811c823b >> #19 [ffff8800a57bbef0] ____fput at ffffffff811c846e >> #20 [ffff8800a57bbf00] task_work_run at ffffffff81093b27 >> #21 [ffff8800a57bbf30] do_notify_resume at ffffffff81013b0c >> #22 [ffff8800a57bbf50] int_signal at ffffffff8161405d >> >> As it turns out, this is because the i_itemp pointer, along >> with the d_ops pointer, has been overwritten with zeros >> when we tear down the extents during truncate. When the in-core >> inode fork on the temporary inode used by xfs_fsr was originally >> set up during the extent swap, we mistakenly looked at di_nextents >> to determine whether all extents fit inline, but this misses extents >> generated by speculative preallocation; we should be using if_bytes >> instead. >> > > Neat bug. :P The fix looks fine, but technically di_nextents doesn't > include any delalloc extents, right? From taking a quick skim through > the test case, it looks like the problematic situation is when the file > flush doesn't end up converting post-eof delalloc (created via > speculative prealloc) due to free space fragmentation. > > So in other words, in most cases a file flush will convert all delalloc, > including post-eof preallocation, by virtue of being part of an extent > that is partly within file eof. In the fragmented free space situation, > however, the file flush is not necessarily guaranteed to convert all > delalloc blocks past eof, thus di_nextents is not consistent with the > in-core inode state, and badness ensues... To be honest I hadn't thought it through that far. Could modify the testcase to free up more contiguous space prior to the last flush, perhaps, and see what happens. -Eric > If I'm following that correctly it would be nice to just clarify that a > bit in the commit log. Otherwise looks good to me: > > Reviewed-by: Brian Foster <bfoster@xxxxxxxxxx> > >> This mistake corrupts the in-memory inode, and code in >> xfs_iext_remove_inline eventually gets bad inputs, causing >> it to memmove and memset incorrect ranges; this became apparent >> because the two values in ifp->if_u2.if_inline_ext[1] contained >> what should have been in d_ops and i_itemp; they were memmoved due >> to incorrect array indexing and then the original locations >> were zeroed with memset, again due to an array overrun. >> >> Fix this by properly using i_df.if_bytes to determine the number >> of extents, not di_nextents. >> >> Thanks to dchinner for looking at this with me and spotting the >> root cause. >> >> Cc: stable@xxxxxxxxxxxxxxx >> Signed-off-by: Eric Sandeen <sandeen@xxxxxxxxxx> >> --- >> >> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c >> index 552465e..47074e0 100644 >> --- a/fs/xfs/xfs_bmap_util.c >> +++ b/fs/xfs/xfs_bmap_util.c >> @@ -1792,6 +1792,7 @@ >> struct xfs_ifork tempifp, *ifp, *tifp; >> int aforkblks = 0; >> int taforkblks = 0; >> + xfs_extnum_t nextents; >> __uint64_t tmp; >> int error; >> >> @@ -1881,7 +1882,8 @@ >> * pointer. Otherwise it's already NULL or >> * pointing to the extent. >> */ >> - if (ip->i_d.di_nextents <= XFS_INLINE_EXTS) { >> + nextents = ip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t); >> + if (nextents <= XFS_INLINE_EXTS) { >> ifp->if_u1.if_extents = >> ifp->if_u2.if_inline_ext; >> } >> @@ -1900,7 +1902,8 @@ >> * pointer. Otherwise it's already NULL or >> * pointing to the extent. >> */ >> - if (tip->i_d.di_nextents <= XFS_INLINE_EXTS) { >> + nextents = tip->i_df.if_bytes / (uint)sizeof(xfs_bmbt_rec_t); >> + if (nextents <= XFS_INLINE_EXTS) { >> tifp->if_u1.if_extents = >> tifp->if_u2.if_inline_ext; >> } >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in >> the body of a message to majordomo@xxxxxxxxxxxxxxx >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-xfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html