Re: [PATCH] xfs: log recovery should replay deferred ops in order

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 30 Nov 2017 22:31:42 +0200



On Wed, Nov 22, 2017 at 8:29 PM, Darrick J. Wong
<darrick.wong@xxxxxxxxxx> wrote:
> From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
>
> As part of testing log recovery with dm_log_writes, Amir Goldstein
> discovered an error in the deferred ops recovery that lead to corruption
> of the filesystem metadata if a reflink+rmap filesystem happened to shut
> down midway through a CoW remap:
>
> "This is what happens [after failed log recovery]:
>
> "Phase 1 - find and verify superblock...
> "Phase 2 - using internal log
> "        - zero log...
> "        - scan filesystem freespace and inode maps...
> "        - found root inode chunk
> "Phase 3 - for each AG...
> "        - scan (but don't clear) agi unlinked lists...
> "        - process known inodes and perform inode discovery...
> "        - agno = 0
> "data fork in regular inode 134 claims CoW block 376
> "correcting nextents for inode 134
> "bad data fork in inode 134
> "would have cleared inode 134"
>
> Hou Tao dissected the log contents of exactly such a crash:
>
> "According to the implementation of xfs_defer_finish(), these ops should
> be completed in the following sequence:
>
> "Have been done:
> "(1) CUI: Oper (160)
> "(2) BUI: Oper (161)
> "(3) CUD: Oper (194), for CUI Oper (160)
> "(4) RUI A: Oper (197), free rmap [0x155, 2, -9]
>
> "Should be done:
> "(5) BUD: for BUI Oper (161)
> "(6) RUI B: add rmap [0x155, 2, 137]
> "(7) RUD: for RUI A
> "(8) RUD: for RUI B
>
> "Actually be done by xlog_recover_process_intents()
> "(5) BUD: for BUI Oper (161)
> "(6) RUI B: add rmap [0x155, 2, 137]
> "(7) RUD: for RUI B
> "(8) RUD: for RUI A
>
> "So the rmap entry [0x155, 2, -9] for COW should be freed firstly,
> then a new rmap entry [0x155, 2, 137] will be added. However, as we can see
> from the log record in post_mount.log (generated after umount) and the trace
> print, the new rmap entry [0x155, 2, 137] are added firstly, then the rmap
> entry [0x155, 2, -9] are freed."
>
> When reconstructing the internal log state from the log items found on
> disk, it's required that deferred ops replay in exactly the same order
> that they would have had the filesystem not gone down.  However,
> replaying unfinished deferred ops can create /more/ deferred ops.  These
> new deferred ops are finished in the wrong order.  This causes fs
> corruption and replay crashes, so let's create a single defer_ops to
> handle the subsequent ops created during replay, then use one single
> transaction at the end of log recovery to ensure that everything is
> replayed in the same order as they're supposed to be.
>
> Reported-by: Amir Goldstein <amir73il@xxxxxxxxx>
> Analyzed-by: Hou Tao <houtao1@xxxxxxxxxx>
> Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> ---

Since it fixes the problem I reported and I couldn't link this patch
with other crashes
I found, you can add:
Tested-by: Amir Goldstein <amir73il@xxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html