> On Aug 23, 2023, at 10:05 PM, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > On Mon, Aug 21, 2023 at 10:06:45PM +0000, Wengang Wang wrote: >> >> >>> On Aug 17, 2023, at 8:25 PM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: >>> >>> What things looks to be: >>> >>> For the file deletion. log bytes are reserved basing on xfs_mount->tr_itruncate which is: >>> { >>> tr_logres = 175488, >>> tr_logcount = 2, >>> tr_logflags = XFS_TRANS_PERM_LOG_RES, >>> } >>> You see it’s permanent log reservation with two log operations (two transactions in rolling mode). >>> After calculation (xlog_calc_unit_res(), adding space for various log headers), the final >>> log space needed per transaction changes from 175488 to 180208 bytes. So the total >>> log space needed is 360416 (180208 * 2). >>> Above log space (360416 bytes) needs to be reserved for both run time inode removing >>> (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()). >>> >>> RUNTIME >>> ======== >>> >>> For run time inode removing. The first transaction is mainly used for inode fields change. >>> The second transaction is used for intents including extent freeing. >>> >>> For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved >>> for the coming transaction). >>> The first transaction is committed, writing some bytes to log and releasing the left reserved bytes. >>> >>> Now we have the second transaction which has 180208 log bytes reserved too. The second >>> transaction is supposed to process intents including extent freeing. With my hacking patch, >>> I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. >>> >>> With my test case, other transactions (update timestamps) then happen. As my hacking patch >>> pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available >>> log space (in memory in on disk). And finally the on disk (and in memory) available log space >>> goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free) >>> transaction. >>> >>> Panic the kernel and remount the xfs volume >>> >>> LOG RECOVER >>> ============= >>> >>> With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs >>> 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available. >>> >>> THINKING >>> ======== >>> Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd >>> transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have >>> (a bit) more than 180208 available log bytes on disk, so the reservation goes and the recovery goes. >>> That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res >>> xfs_mount->tr_ext_free >>> { >>> tr_logres = 175488, >>> tr_logcount = 0, >>> tr_logflags = 0, >>> } >>> and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it). >>> >> >> The following patch recovers the problematic XFS volume by my hacked kernel and the also >> the one from customer. >> >> commit 19fad903e213717a92f8b94fe2c0c68b6a6ee7f7 (HEAD -> 35587163_fix) >> Author: Wengang Wang <wen.gang.wang@xxxxxxxxxx> >> Date: Mon Aug 21 15:03:58 2023 -0700 >> >> xfs: reserve less log space when recovering EFIs >> >> Currently tr_itruncate is used for both run time truncating and >> boot time EFI recovery. tr_itruncate >> { >> tr_logres = 175488, >> tr_logcount = 2, >> tr_logflags = XFS_TRANS_PERM_LOG_RES, >> } >> >> Is a permanent two-transaction series. Actually only the second transaction >> is really used to free extents and that needs half of the log space reservation >> from tr_itruncate. >> >> For EFI recovery, the things to do is just to free extents, so it doesn't >> needs full log space reservation by tr_itruncate. It needs half of it and >> shouldn't need more than that. >> >> Signed-off-by: Wengang Wang <wen.gang.wang@xxxxxxxxxx> >> >> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >> index f1a5ecf099aa..428984e48d23 100644 >> --- a/fs/xfs/xfs_extfree_item.c >> +++ b/fs/xfs/xfs_extfree_item.c > u> @@ -667,6 +667,7 @@ xfs_efi_item_recover( >> int i; >> int error = 0; >> bool requeue_only = false; >> + struct xfs_trans_res tres; >> >> /* >> * First check the validity of the extents described by the >> @@ -683,7 +684,10 @@ xfs_efi_item_recover( >> } >> } >> >> - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); >> + tres.tr_logres = M_RES(mp)->tr_itruncate.tr_logres; >> + tres.tr_logcount = 0; > > HAH that was fast. I'm glad it worked. > >> + tres.tr_logflags = 0; > > Though I think we should preserve XFS_TRANS_PERM_LOG_RES since other > parts of the codebase check that the flag was conveyed from the > transaction reservation into tp->t_flags itself. > Sure, I could do that. I just wondering the meaning of XFS_TRANS_PERM_LOG_RES. I am thinking it means that the to be reserved log bytes would be used for multiple transactions in a chain. And the tr_logcount indicates the number of the transactions in the chain. So my understand is that a single transaction is not a chain (surely, you can understand it as a one-transaction chain). So XFS_TRANS_PERM_LOG_RES is not needed. It that a correct understanding? > I think for an upstream patch we'd rather fix all of them, and in a > systemic way. How about adding this to xfs_log_recover.h: > > /* > * Transform a regular reservation into one suitable for recovery of a > * log intent item. > * > * Intent recovery only runs a single step of the transaction chain and > * defers the rest to a separate transaction. Therefore, we reduce > * logcount to 1 here to avoid livelocks if the log grant space is > * nearly exhausted due to the recovered intent pinning the tail. Keep > * the same logflags to avoid tripping asserts elsewhere. Struct copies > * abound below. > */ > static inline struct xfs_trans_res > xlog_recover_resv(const struct xfs_trans_res *template) > { > struct xfs_trans_res ret = { > .tr_logres = template->tr_logres, > .tr_logcount = 1, > .tr_logflags = template->tr_logflags, > }; > > return ret; > } > > and then this becomes: > > struct xfs_trans_res resv; > > resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); > error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); > > which should simplify fixing this for the other recovery functions? > Yes, I could do that. thanks, wengang > --D > >> + error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp); > >> if (error) >> return error; >> efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); >> >> thanks, >> wengang >> >>> thanks, >>> wengang >>> >>>> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@xxxxxxxxxx> wrote: >>>> >>>> >>>> >>>>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>> >>>>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: >>>>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote: >>>>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: >>>>>>>> FYI: >>>>>>>> >>>>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on >>>>>>>> both 4.14.35 kernel or 6.4.0 kernel. >>>>>>>> Reproduce steps: >>>>>>>> >>>>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following >>>>>>>> steps all aim at this XFS volume. >>>>>>> >>>>>>> Actually, make that a few milliseconds.... :) >>>>>> >>>>>> :) >>>>>> >>>>>>> mkfs/xfs_info output would be appreciated. >>>>>> >>>>>> sure, >>>>>> # xfs_info 20GB.bk2 >>>>>> meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks >>>>>> = sectsz=512 attr=2, projid32bit=1 >>>>>> = crc=0 finobt=0, sparse=0, rmapbt=0 >>>>>> = reflink=0 >>>>> >>>>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated >>>>> and support is largely due to be dropped from upstream in 2025... >>>>> >>>> >>>> Ha, it just happened to be so. >>>> I was trying to reproduce it in the same environment as customer has — >>>> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format >>>> v4 filesystems. I created the xfs image in a file on OL7 and copied the image >>>> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here. >>>> >>>>> Does the same problem occur with a v5 filesystems? >>>> >>>> Will test and report back. >>>> >>>>> >>>>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 >>>>>>>> kernel and 6.4.0 kernel. >>>>>>> >>>>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e >>>>>>> tr_itrunc), because that's what the EFI being processed that pins >>>>>>> the tail of the log is supposed to have reserved when it was >>>>>>> stalled. >>>>>> >>>>>> Yep, exactly. >>>>>> >>>>>>> So where has the ~180kB of leaked space come from? >>>>>>> >>>>>>> Have you traced the grant head reservations to find out >>>>>>> what the runtime log space and grant head reservations actually are? >>>>>> I have the numbers in vmcore (ignore the WARNs), >>>>> >>>>> That's not what I'm asking. You've dumped the values at the time of >>>>> the hang, not traced the runtime reservations that have been made. >>>>> >>>>>>> i.e. we have full tracing of the log reservation accounting via >>>>>>> tracepoints in the kernel. If there is a leak occurring, you need to >>>>>>> capture a trace of all the reservation accounting operations and >>>>>>> post process the output to find out what operation is leaking >>>>>>> reserved space. e.g. >>>>>>> >>>>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo >>>>>>> .... >>>>>>> # trace-cmd report > s.t >>>>>>> # head -3 s.t >>>>>>> cpus=16 >>>>>>> touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>>>>> touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>>>>> >>>>>>> # >>>>>>> >>>>>>> So this tells us the transaction reservation unit size, the count of >>>>>>> reservations, the current reserve and grant head locations, and the >>>>>>> current head and tail of the log at the time the transaction >>>>>>> reservation is started and then after it completes. >>>>>> >>>>>> Will do that and report back. You want full log or only some typical >>>>>> ones? Full log would be big, how shall I share? >>>>> >>>>> I don't want to see the log. It'll be huge - I regularly generate >>>>> traces containing gigabytes of log accounting traces like this from >>>>> a single workload. >>>>> >>>>> What I'm asking you to do is run the tracing and then post process >>>>> the values from the trace to determine what operation is using more >>>>> space than is being freed back to the log. >>>>> >>>>> I generally do this with grep, awk and sed. some people use python >>>>> or perl. But either way it's a *lot* of work - in the past I have >>>>> spent _weeks_ on trace analysis to find a 4 byte leak in the log >>>>> space accounting. DOing things like graphing the head, tail and grant >>>>> spaces over time tend to show if this is a gradual leak versus a >>>>> sudden step change. If it's a sudden step change, then you can >>>>> isolate it in the trace and work out what happened. If it's a >>>>> gradual change, then you need to start looking for accounting >>>>> discrepancies... >>>>> >>>>> e.g. a transaction records 32 bytes used in the item, so it releases >>>>> t_unit - 32 bytes at commit. However, the CIL may then only track 28 >>>>> bytes of space for the item in the journal and we leak 4 bytes of >>>>> reservation on every on of those items committed. >>>>> >>>>> These sorts of leaks typically only add up to being somethign >>>>> significant in situations where the log is flooded with tiny inode >>>>> timestamp changes - 4 bytes iper item doesn't really matter when you >>>>> only have a few thousand items in the log, but when you have >>>>> hundreds of thousands of tiny items in the log... >>>>> >>>> >>>> OK. will work more on this. >>>> # I am going to start a two-week vacation, and will then continue on this when back. >>>> >>>> thanks, >>>> wengang