Re: Question: reserve log space at IO time for recover

Wengang Wang <wen.gang.wang@xxxxxxxxxx> · Wed, 19 Jul 2023 01:46:31 +0000

> On Jul 18, 2023, at 5:11 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> 
> On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote:
>> Hi,
>> 
>> I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
>> mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
>> free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
>> Thus the mount hangs for ever.
> 
> Most likely something went wrong at runtime on the 4.14.35 kernel
> prior to the crash, leaving the on-disk state in an impossible to
> recover state. Likely an accounting leak in a transaction
> reservation somewhere, likely in passing the space used from the
> transaction to the CIL. We've had bugs in this area before, they
> eventually manifest in log hangs like this either at runtime or
> during recovery...
> 
>> That happens with 4.14.35 kernel and also upstream
>> kernel (6.4.0).
> 
> Upgrading the kernel won't fix recovery - it is likely that the
> journal state on disk is invalid and so the mount cannot complete 
> 
>> The is the related stack dumping (6.4.0 kernel):
>> 
>> [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs]
>> [<0>] xlog_grant_head_check+0xd9/0x100 [xfs]
>> [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs]
>> [<0>] xfs_trans_reserve+0x138/0x170 [xfs]
>> [<0>] xfs_trans_alloc+0xe8/0x220 [xfs]
>> [<0>] xfs_efi_item_recover+0x110/0x250 [xfs]
>> [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs]
>> [<0>] xlog_recover_finish+0x33/0x310 [xfs]
>> [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs]
>> [<0>] xfs_mountfs+0x51c/0x900 [xfs]
>> [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs]
>> [<0>] get_tree_bdev+0x193/0x280
>> [<0>] vfs_get_tree+0x26/0xd0
>> [<0>] path_mount+0x69d/0x9b0
>> [<0>] do_mount+0x7d/0xa0
>> [<0>] __x64_sys_mount+0xdc/0x100
>> [<0>] do_syscall_64+0x3b/0x90
>> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>> 
>> Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover
>> safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong).
> 
> Sure they do.
> 
> Log space usage is what the grant heads track; transactions are not
> allowed to start if there isn't both reserve and write grant head
> space available for them, and transaction rolls get held until there
> is write grant space available for them (i.e. they can block in
> xfs_trans_roll() -> xfs_trans_reserve() waiting for write grant head
> space).
> 
> There have been bugs in the grant head accounting mechanisms in the
> past, there may well still be bugs in it. But it is the grant head
> mechanisms that is supposed to guarantee there is always space in
> the journal for a transaction to commit, and by extension, ensure
> that we always have space in the journal for a transaction to be
> fully recovered.
> 
>> So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to
>> ensure log recovery safe?  The number of UR is determined by current un flushed log items.
>> It gets increased just after transaction is committed and gets decreased when log items are
>> flushed. With the UR, we are safe to have enough log space for the transactions used by log
>> recovery.
> 
> The grant heads already track log space usage and reservations like
> this. If you want to learn more about the nitty gritty details, look
> at this patch set that is aimed at changing how the grant heads
> track the used/reserved log space to improve performance:
> 
> https://lore.kernel.org/linux-xfs/20221220232308.3482960-1-david@xxxxxxxxxxxxx/

Thanks Dave a lot!
I will look more into the write head and above patch set.

Have a good day,
Wengang