Hi, Dave Thanks for your reply. On Mon, Jan 08, 2024 at 09:13:50AM +1100, Dave Chinner wrote: > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote: > > While performing the IO fault injection test, I caught the following data > > corruption report: > > > > XFS (dm-0): Internal error ltbno + ltlen > bno at line 1957 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0x79c/0x1130 > > CPU: 3 PID: 33 Comm: kworker/3:0 Not tainted 6.5.0-rc7-next-20230825-00001-g7f8666926889 #214 > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 > > Workqueue: xfs-inodegc/dm-0 xfs_inodegc_worker > > Call Trace: > > <TASK> > > dump_stack_lvl+0x50/0x70 > > xfs_corruption_error+0x134/0x150 > > xfs_free_ag_extent+0x7d3/0x1130 > > __xfs_free_extent+0x201/0x3c0 > > xfs_trans_free_extent+0x29b/0xa10 > > xfs_extent_free_finish_item+0x2a/0xb0 > > xfs_defer_finish_noroll+0x8d1/0x1b40 > > xfs_defer_finish+0x21/0x200 > > xfs_itruncate_extents_flags+0x1cb/0x650 > > xfs_free_eofblocks+0x18f/0x250 > > xfs_inactive+0x485/0x570 > > xfs_inodegc_worker+0x207/0x530 > > process_scheduled_works+0x24a/0xe10 > > worker_thread+0x5ac/0xc60 > > kthread+0x2cd/0x3c0 > > ret_from_fork+0x4a/0x80 > > ret_from_fork_asm+0x11/0x20 > > </TASK> > > XFS (dm-0): Corruption detected. Unmount and run xfs_repair > > > > After analyzing the disk image, it was found that the corruption was > > triggered by the fact that extent was recorded in both the inode and AGF > > btrees. After a long time of reproduction and analysis, we found that the > > root cause of the problem was that the AGF btree block was not recovered. > > Why was it not recovered? Because of an injected IO error during > recovery? The reason why the buf item of AGF btree is not recovery is that the LSN of AGF btree block equal to the current LSN of the recovery item, Because log recovery skips items with a metadata LSN >= the current LSN of the recovery item. Injected IO error during recovery cause that the LSN of AGF btree block equal to the current LSN of the recovery item, that's happend in the situation of two transaction on disk share same LSN and both modify the same buffer. Detailed information can be found below. > > > Consider the following situation, Transaction A and Transaction B are in > > the same record, so Transaction A and Transaction B share the same LSN1. > > If the buf item in Transaction A has been recovered, then the buf item in > > Transaction B cannot be recovered, because log recovery skips items with a > > metadata LSN >= the current LSN of the recovery item. > > This makes no sense to me. Transactions don't exist in the journal; > they are purely in-memory constructs that are aggregated > in memory (in the CIL) before being written to disk as an atomic > checkpoint. Hence a log item can only appear once in a checkpoint > regardless of how many transactions it is modified in memory between > CIL checkpoints. > > > If there is still an > > inode item in transaction B that records the Extent X, the Extent X will > > be recorded in both the inode and the AGF btree block after transaction B > > is recovered. > > That transaction should record both the addition to the inode BMBT > and the removal from the AGF. Hence if transaction B is recovered in > full with no errors, this should not occur. > > > > > |------------Record (LSN1)------------------|---Record (LSN2)---| > > |----------Trans A------------|-------------Trans B-------------| > > | Buf Item(Extent X) | Buf Item / Inode item(Extent X) | > > | Extent X is freed | Extent X is allocated | > > This looks wrong. A transaction can only exist in a single CIL > checkpoint and everything in a checkpoint has the same LSN. Hence we > cannot have the situation where trans B spans two different > checkpoints and hence span LSNs. There is some misunderstanding here. Transactions that I said is on disk, not in memory. Each transaction on disk corresponds to a checkpoint(This is my understanding, or we can call it as checkpoint transaction just like <<XFS Algorithms & Data Structures>>), The two are easily confused, and their meanings are not the same. The transaction on disk can spans two different record. The following logs show the details: //Trans A, tid d0bfef23 //Trans B, tid 9a76bd30 ============================================================================ cycle: 271 version: 2 lsn: 271,14642 tail_lsn: 271,12644 length of Log Record: 32256 prev offset: 14608 num ops: 249 uuid: 01ce1afc-cedd-4120-8d8d-05fbee260af9 format: little endian linux h_size: 32768 ---------------------------------------------------------------------------- Oper (0): tid: d0bfef23 len: 0 clientid: TRANS flags: START ---------------------------------------------------------------------------- Oper (1): tid: d0bfef23 len: 16 clientid: TRANS flags: none TRAN: tid: d0bfef23 num_items: 145 ---------------------------------------------------------------------------- ...... ---------------------------------------------------------------------------- Oper (5): tid: d0bfef23 len: 56 clientid: TRANS flags: none INODE: #regs: 3 ino: 0xc4aa3 flags: 0x41 dsize: 0 blkno: 805536 len: 32 boff: 1536 Oper (6): tid: d0bfef23 len: 176 clientid: TRANS flags: none INODE CORE magic 0x494e mode 0100666 version 3 format 2 nlink 1 uid 0 gid 0 atime 0xd014ae00 mtime 0xfe187700 ctime 0xfe187700 size 0x0 nblocks 0x0 extsize 0x0 nextents 0x0 /* inode 0xc4aa3 nblocks is 0 */ naextents 0x0 forkoff 35 dmevmask 0x0 dmstate 0x0 flags 0x0 gen 0x526bda01 flags2 0x8 cowextsize 0x0 Oper (7): tid: d0bfef23 len: 52 clientid: TRANS flags: none LOCAL attr data ---------------------------------------------------------------------------- ...... ---------------------------------------------------------------------------- Oper (102): tid: d0bfef23 len: 24 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 1048577 (0x100001) len: 1 bmap size: 1 flags: 0x2800 Oper (103): tid: d0bfef23 len: 128 clientid: TRANS flags: none AGF Buffer: XAGF ver: 1 seq#: 2 len: 65536 root BNO: 3 CNT: 4 level BNO: 1 CNT: 1 1st: 110 last: 113 cnt: 4 freeblks: 40923 longest: 37466 ---------------------------------------------------------------------------- Oper (104): tid: d0bfef23 len: 24 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 1048600 (0x100018) len: 8 bmap size: 1 flags: 0x2000 Oper (105): tid: d0bfef23 len: 768 clientid: TRANS flags: none BUF DATA 0 42334241 4a000000 ffffffff ffffffff 0 18001000 f010000 ff300000 8 fc1ace01 2041ddce fb058d8d f90a26ee 2000000 128ec3ad 50a0000 5d000000 10 770a0000 21000000 a00a0000 b5000000 580b0000 2d010000 fa0c0000 10000000 /* extent (770a0000 21000000) recorded in the AGF */ 18 170d0000 b000000 3e0d0000 1a000000 710d0000 5000000 a60d0000 3000000 20 ee0d0000 91000000 bf0e0000 1000000 c80e0000 3000000 d60e0000 62010000 28 47100000 a000000 68100000 6000000 89500000 a0000000 94510000 4000000 30 c2510000 60000000 3e520000 31000000 8a520000 1e000000 b0520000 1000000 38 ce520000 1e000000 ee520000 11000000 8530000 18000000 14540000 23000000 40 70540000 2c000000 a3540000 19000000 d2540000 13000000 9550000 19000000 48 35550000 2000000 61550000 e000000 81550000 11000000 a0550000 4000000 50 b6550000 6000000 d4550000 5000000 df550000 3000000 d560000 8000000 58 24560000 5000000 7b560000 2000000 c8560000 10000000 df560000 36000000 60 74570000 20000000 a2570000 5d000000 16580000 13000000 46580000 3a000000 68 91580000 9000000 3a590000 1000000 6a590000 2000000 5a0000 15000000 70 525a0000 35000000 e85a0000 1000000 d95b0000 2000000 615d0000 28000000 78 485e0000 d3000000 2a5f0000 9000000 495f0000 44000000 c15f0000 a9000000 80 cd600000 c000000 34610000 4000000 99610000 1000000 b9610000 1c000000 88 dc610000 2000000 e2610000 84000000 77620000 1f000000 95630000 8c000000 90 77640000 13000000 73650000 71000000 7a690000 5000000 8e690000 6000000 98 576a0000 26000000 376b0000 3a000000 ce6b0000 1e000000 66c0000 a7000000 a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 a8 0 0 0 0 0 0 0 0 b0 0 0 0 0 0 0 0 0 b8 0 0 0 0 0 0 0 0 ---------------------------------------------------------------------------- Oper (106): tid: d0bfef23 len: 24 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 1048608 (0x100020) len: 8 bmap size: 1 flags: 0x2000 Oper (107): tid: d0bfef23 len: 768 clientid: TRANS flags: none BUF DATA 0 43334241 4a000000 ffffffff ffffffff 0 20001000 f010000 ff300000 8 fc1ace01 2041ddce fb058d8d f90a26ee 2000000 3ea37604 bf0e0000 1000000 10 b0520000 1000000 3a590000 1000000 e85a0000 1000000 99610000 1000000 18 35550000 2000000 7b560000 2000000 6a590000 2000000 d95b0000 2000000 20 dc610000 2000000 a60d0000 3000000 c80e0000 3000000 df550000 3000000 28 94510000 4000000 a0550000 4000000 34610000 4000000 710d0000 5000000 30 d4550000 5000000 24560000 5000000 7a690000 5000000 68100000 6000000 38 b6550000 6000000 8e690000 6000000 d560000 8000000 91580000 9000000 40 2a5f0000 9000000 47100000 a000000 170d0000 b000000 cd600000 c000000 48 61550000 e000000 fa0c0000 10000000 c8560000 10000000 ee520000 11000000 50 81550000 11000000 d2540000 13000000 16580000 13000000 77640000 13000000 58 5a0000 15000000 8530000 18000000 a3540000 19000000 9550000 19000000 60 3e0d0000 1a000000 b9610000 1c000000 8a520000 1e000000 ce520000 1e000000 68 ce6b0000 1e000000 77620000 1f000000 74570000 20000000 770a0000 21000000 70 14540000 23000000 576a0000 26000000 615d0000 28000000 70540000 2c000000 78 3e520000 31000000 525a0000 35000000 df560000 36000000 46580000 3a000000 80 376b0000 3a000000 495f0000 44000000 50a0000 5d000000 a2570000 5d000000 88 c2510000 60000000 73650000 71000000 e2610000 84000000 95630000 8c000000 90 ee0d0000 91000000 89500000 a0000000 66c0000 a7000000 c15f0000 a9000000 98 a00a0000 b5000000 485e0000 d3000000 580b0000 2d010000 d60e0000 62010000 a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 a8 0 0 0 0 0 0 0 0 b0 0 0 0 0 0 0 0 0 b8 0 0 0 0 0 0 0 0 ---------------------------------------------------------------------------- ...... ---------------------------------------------------------------------------- Oper (147): tid: d0bfef23 len: 0 clientid: TRANS flags: COMMIT ---------------------------------------------------------------------------- Oper (148): tid: 9a76bd30 len: 0 clientid: TRANS flags: START ---------------------------------------------------------------------------- Oper (149): tid: 9a76bd30 len: 16 clientid: TRANS flags: none TRAN: tid: 9a76bd30 num_items: 164 ---------------------------------------------------------------------------- ...... ---------------------------------------------------------------------------- Oper (248): tid: 9a76bd30 len: 168 clientid: TRANS flags: CONTINUE ============================================================================ cycle: 271 version: 2 lsn: 271,14706 tail_lsn: 271,12644 length of Log Record: 10752 prev offset: 14642 num ops: 67 uuid: 01ce1afc-cedd-4120-8d8d-05fbee260af9 format: little endian linux h_size: 32768 ---------------------------------------------------------------------------- Oper (0): tid: 9a76bd30 len: 8 clientid: TRANS flags: WAS_CONT END Left over region from split log item ---------------------------------------------------------------------------- Oper (1): tid: 9a76bd30 len: 52 clientid: TRANS flags: none Left over region from split log item ---------------------------------------------------------------------------- ...... ---------------------------------------------------------------------------- Oper (5): tid: 9a76bd30 len: 56 clientid: TRANS flags: none INODE: #regs: 4 ino: 0xc4aa3 flags: 0x45 dsize: 32 blkno: 805536 len: 32 boff: 1536 Oper (6): tid: 9a76bd30 len: 176 clientid: TRANS flags: none INODE CORE magic 0x494e mode 0100666 version 3 format 2 nlink 0 uid 0 gid 0 atime 0xf4e300 mtime 0x2261000 ctime 0x2a02200 size 0x1a9600 nblocks 0x3f extsize 0x0 nextents 0x2 /* inode 0xc4aa3 nblocks is 0x3f */ naextents 0x0 forkoff 24 dmevmask 0x0 dmstate 0x0 flags 0x0 gen 0x526bda01 flags2 0x8 cowextsize 0x0 Oper (7): tid: 9a76bd30 len: 32 clientid: TRANS flags: none EXTENTS inode data Oper (8): tid: 9a76bd30 len: 144 clientid: TRANS flags: none LOCAL attr data ---------------------------------------------------------------------------- ...... ---------------------------------------------------------------------------- Oper (57): tid: 9a76bd30 len: 24 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 1048577 (0x100001) len: 1 bmap size: 1 flags: 0x2800 Oper (58): tid: 9a76bd30 len: 128 clientid: TRANS flags: none AGF Buffer: XAGF ver: 1 seq#: 2 len: 65536 root BNO: 3 CNT: 4 level BNO: 1 CNT: 1 1st: 114 last: 117 cnt: 4 freeblks: 40714 longest: 37466 ---------------------------------------------------------------------------- Oper (59): tid: 9a76bd30 len: 24 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 1048608 (0x100020) len: 8 bmap size: 1 flags: 0x2000 Oper (60): tid: 9a76bd30 len: 768 clientid: TRANS flags: none BUF DATA 0 43334241 47000000 ffffffff ffffffff 0 20001000 f010000 ff300000 8 fc1ace01 2041ddce fb058d8d f90a26ee 2000000 3ea37604 b0520000 1000000 10 3a590000 1000000 e85a0000 1000000 99610000 1000000 dc610000 1000000 18 35550000 2000000 7b560000 2000000 6a590000 2000000 d95b0000 2000000 20 a60d0000 3000000 c80e0000 3000000 df550000 3000000 34610000 4000000 28 710d0000 5000000 d4550000 5000000 24560000 5000000 7a690000 5000000 30 68100000 6000000 b6550000 6000000 8e690000 6000000 d560000 8000000 38 91580000 9000000 2a5f0000 9000000 5d620000 9000000 47100000 a000000 40 170d0000 b000000 cd600000 c000000 580b0000 d000000 61550000 e000000 48 fa0c0000 10000000 d5540000 10000000 c8560000 10000000 ee520000 11000000 50 81550000 11000000 16580000 13000000 77640000 13000000 5a0000 15000000 58 8530000 18000000 a3540000 19000000 9550000 19000000 3e0d0000 1a000000 60 b9610000 1c000000 8a520000 1e000000 ce520000 1e000000 ce6b0000 1e000000 68 77620000 1f000000 74570000 20000000 14540000 23000000 576a0000 26000000 70 615d0000 28000000 70540000 2c000000 3e520000 31000000 525a0000 35000000 78 df560000 36000000 46580000 3a000000 376b0000 3a000000 495f0000 44000000 80 c6510000 5c000000 50a0000 5d000000 a2570000 5d000000 73650000 71000000 88 95630000 8c000000 ee0d0000 91000000 89500000 a0000000 66c0000 a7000000 90 c15f0000 a9000000 a00a0000 b5000000 485e0000 d3000000 830b0000 2010000 98 dc0e0000 5c010000 a66d0000 5a920000 a66d0000 5a920000 a66d0000 5a920000 a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 a8 0 0 0 0 0 0 0 0 b0 0 0 0 0 0 0 0 0 b8 0 0 0 0 0 0 0 0 ---------------------------------------------------------------------------- Oper (61): tid: 9a76bd30 len: 24 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 1048600 (0x100018) len: 8 bmap size: 1 flags: 0x2000 Oper (62): tid: 9a76bd30 len: 768 clientid: TRANS flags: none BUF DATA 0 42334241 47000000 ffffffff ffffffff 0 18001000 f010000 ff300000 8 fc1ace01 2041ddce fb058d8d f90a26ee 2000000 128ec3ad 50a0000 5d000000 10 a00a0000 b5000000 580b0000 d000000 830b0000 2010000 fa0c0000 10000000 /* extent (770a0000 21000000) has been removed from the AGF */ 18 170d0000 b000000 3e0d0000 1a000000 710d0000 5000000 a60d0000 3000000 20 ee0d0000 91000000 c80e0000 3000000 dc0e0000 5c010000 47100000 a000000 28 68100000 6000000 89500000 a0000000 c6510000 5c000000 3e520000 31000000 30 8a520000 1e000000 b0520000 1000000 ce520000 1e000000 ee520000 11000000 38 8530000 18000000 14540000 23000000 70540000 2c000000 a3540000 19000000 40 d5540000 10000000 9550000 19000000 35550000 2000000 61550000 e000000 48 81550000 11000000 b6550000 6000000 d4550000 5000000 df550000 3000000 50 d560000 8000000 24560000 5000000 7b560000 2000000 c8560000 10000000 58 df560000 36000000 74570000 20000000 a2570000 5d000000 16580000 13000000 60 46580000 3a000000 91580000 9000000 3a590000 1000000 6a590000 2000000 68 5a0000 15000000 525a0000 35000000 e85a0000 1000000 d95b0000 2000000 70 615d0000 28000000 485e0000 d3000000 2a5f0000 9000000 495f0000 44000000 78 c15f0000 a9000000 cd600000 c000000 34610000 4000000 99610000 1000000 80 b9610000 1c000000 dc610000 1000000 5d620000 9000000 77620000 1f000000 88 95630000 8c000000 77640000 13000000 73650000 71000000 7a690000 5000000 90 8e690000 6000000 576a0000 26000000 376b0000 3a000000 ce6b0000 1e000000 98 66c0000 a7000000 a66d0000 5a920000 a66d0000 5a920000 a66d0000 5a920000 a0 a66d0000 5a920000 336d0000 cd920000 336d0000 cd920000 336d0000 cd920000 a8 0 0 0 0 0 0 0 0 b0 0 0 0 0 0 0 0 0 b8 0 0 0 0 0 0 0 0 ---------------------------------------------------------------------------- ...... ---------------------------------------------------------------------------- Oper (66): tid: 9a76bd30 len: 0 clientid: TRANS flags: COMMIT > > These are valid representations: > > |------------Record (LSN1)----|-----------------Record (LSN2)---| > |----------Trans A------------|-------------Trans B-------------| > > |------------Record (LSN1)--------------------------------------| > |----------Trans A------------|-------------Trans B-------------| > > |-----------------------------------------------Record (LSN2)---| > |----------Trans A------------|-------------Trans B-------------| > > Only in the first case are there two instances of the AGF buf item > object in the journal (one in each checkpoint). In the latter two > cases, there is only one copy of the AGF buf log item that contains > extent X. Indeed, it will not contain extent X, because the CIL > aggregation results in the addition in trans A being elided by the > removal in trans B, essentially resulting in the buffer being > unchanged except for the LSN after recovery. > > As such, I'm really not sure what you are trying to describe here - > if recovery of the checkpoint at LSN1 fails in any way, we should > never attempt to recovery the checkpoint at LSN2. If LSN1 recoveres > entirely successfully, then LSN2 should see the correct state and > recover appropriately, too. Hence I don't see how the situation you > are describing arises. > > > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on > > LSN boundaries") was introduced, we submit buffers on lsn boundaries during > > log recovery. > > Correct - we submit all the changes in a checkpoint for submission > before we start recovering the next checkpoint. That's because > checkpoints are supposed to be atomic units of change moving the > on-disk state from one change set to the next. Submit buffer on LSN boundaries not means submit buffer on checkpoint boundaries during recovery. In my understanding, One transaction on disk corresponds to a checkpoint, there's maybe multiple transaction on disk share same LSN, so sometimes we should ensure that submit multiple transation one time in such case. This rule was introduced by commit 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries") Now, we can ensure this rule in most case, but we violate the rule when submitting buffer regardless of error in xlog_do_recovery_pass(), this pacth try to fix the problem. If I misunderstand, expect you to point it out. :) Thanks, Long Li > > If any error during processing of a checkpoint occurs, we are > supposed to abort recovery at that checkpoint so we don't create a > situation where future recovery attempts skip checkpoints that need > to be recovered. > > It does not matter if we write back the modified buffers from > partially completed checkpoints - they were successfully recovered > in their entirity, and so it is safe to write them back knowing that > the next attempt to recover the failed checkpoint will see a > matching LSN and skip that buffer item. If writeback fails, then it > just doesn't matter as the next recovery attempt will see the old > LSN and recover that buf item again and write it back.... > > AFAICT, you're describing things working as they are supposed to, > and I don't see where the problem you are attempting to fix is yet. > > > The above problem can be avoided under normal paths, but it's > > not guaranteed under abnormal paths. Consider the following process, if an > > error was encountered after recover buf item in transaction A and before > > recover buf item in transaction B, buffers that have been added to > > buffer_list will still be submitted, this violates the submits rule on lsn > > boundaries. So buf item in Transaction B cannot be recovered on the next > > mount due to current lsn of transaction equal to metadata lsn on disk. > > > > xlog_do_recovery_pass > > error = xlog_recover_process > > xlog_recover_process_data > > ... > > xlog_recover_buf_commit_pass2 > > xlog_recover_do_reg_buffer //recover buf item in Trans A > > xfs_buf_delwri_queue(bp, buffer_list) > > ... > > ====> Encountered error and returned > > What error is this, and why isn't it a fatal error causing the > checkpoint recovery to be aborted and the delwri list to be canceled? > > > ... > > xlog_recover_buf_commit_pass2 > > xlog_recover_do_reg_buffer //recover buf item in Trans B > > xfs_buf_delwri_queue(bp, buffer_list) > > This should never occur as we should be aborting log recovery on the > first failure, not continuing to process the checkpoint or starting > to process other checkpoints. Where are we failing to handle an > error? > > > if (!list_empty(&buffer_list)) > > xfs_buf_delwri_submit(&buffer_list); //submit regardless of error > > Yes, that's fine (as per above). Indeed, this is how we handle > releasing the buffer log item on failure - this goes through IO > completion and that release the buf log item we added to the buffer > during recovery for LSN stamping. > > > In order to make sure that submits buffers on lsn boundaries in the > > abnormal paths, we need to check error status before submit buffers that > > have been added from the last record processed. If error status exist, > > buffers in the bufffer_list should be canceled. > > No, it does not need to be cancelled, it just needs to be processed. > Anything we've fully recovered is safe to write - it's no different > from having the system crash during AIL writeback having written > back these buffers and having to recover from part way through this > checkpoint. > > > Canceling the buffers in the buffer_list directly isn't correct, unlike > > any other place where write list was canceled, these buffers has been > > initialized by xfs_buf_item_init() during recovery and held by buf > > item, buf items will not be released in xfs_buf_delwri_cancel(). If > > these buffers are submitted successfully, buf items assocated with > > the buffer will be released in io end process. So releasing buf item > > in write list cacneling process is needed. > > Yes, that's why we use xfs_buf_delwri_submit() even on error - it > handles releasing the buffer log item in IO completion handling > (even on error). > > > > > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery") > > Signed-off-by: Long Li <leo.lilong@xxxxxxxxxx> > > --- > > fs/xfs/xfs_buf.c | 2 ++ > > fs/xfs/xfs_log_recover.c | 22 +++++++++++++--------- > > 2 files changed, 15 insertions(+), 9 deletions(-) > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c > > index 8e5bd50d29fe..6a1b26aaf97e 100644 > > --- a/fs/xfs/xfs_buf.c > > +++ b/fs/xfs/xfs_buf.c > > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel( > > xfs_buf_lock(bp); > > bp->b_flags &= ~_XBF_DELWRI_Q; > > xfs_buf_list_del(bp); > > + if (bp->b_log_item) > > + xfs_buf_item_relse(bp); > > xfs_buf_relse(bp); > > } > > } > > I don't think this is a good idea - the delwri does not own the > buf log item reference, and so this will cause problems with > anything that already handles buf log item references correctly. > > > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c > > index 1251c81e55f9..2cda6c90890d 100644 > > --- a/fs/xfs/xfs_log_recover.c > > +++ b/fs/xfs/xfs_log_recover.c > > @@ -2964,7 +2964,6 @@ xlog_do_recovery_pass( > > char *offset; > > char *hbp, *dbp; > > int error = 0, h_size, h_len; > > - int error2 = 0; > > int bblks, split_bblks; > > int hblks, split_hblks, wrapped_hblks; > > int i; > > @@ -3203,16 +3202,21 @@ xlog_do_recovery_pass( > > bread_err1: > > kmem_free(hbp); > > > > - /* > > - * Submit buffers that have been added from the last record processed, > > - * regardless of error status. > > - */ > > - if (!list_empty(&buffer_list)) > > - error2 = xfs_buf_delwri_submit(&buffer_list); > > - > > if (error && first_bad) > > *first_bad = rhead_blk; > > > > + /* > > + * If there are no error, submit buffers that have been added from the > > + * last record processed, othrewise cancel the write list, to ensure > > + * submit buffers on LSN boundaries. > > + */ > > + if (!list_empty(&buffer_list)) { > > + if (error) > > + xfs_buf_delwri_cancel(&buffer_list); > > + else > > + error = xfs_buf_delwri_submit(&buffer_list); > > + } > > > + > > /* > > * Transactions are freed at commit time but transactions without commit > > * records on disk are never committed. Free any that may be left in the > > @@ -3226,7 +3230,7 @@ xlog_do_recovery_pass( > > xlog_recover_free_trans(trans); > > } > > > > - return error ? error : error2; > > + return error; > > } > > > > /* > > -- > > 2.31.1 > > > > > > > > -- > Dave Chinner > david@xxxxxxxxxxxxx