Re: [PATCH]xfs_repair: fix the problem of repair failure caused by dirty flag being abnormally set on buffer

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Tue, 18 Jul 2023 18:28:21 -0700

On Mon, Jul 17, 2023 at 10:02:41AM +0800, Wu Guanghao wrote:
> 
> 
> 在 2023/7/13 13:20, Darrick J. Wong 写道:
> > On Wed, Jun 28, 2023 at 10:36:09AM +0800, Wu Guanghao wrote:
> >>
> >>
> >> 在 2023/6/10 10:44, Darrick J. Wong 写道:
> >>> On Fri, Jun 09, 2023 at 09:08:01AM +0800, Wu Guanghao wrote:
> >>>> We found an issue where repair failed in the fault injection.
> >>>>
> >>>> $ xfs_repair test.img
> >>>> ...
> >>>> Phase 3 - for each AG...
> >>>>         - scan and clear agi unlinked lists...
> >>>>         - process known inodes and perform inode discovery...
> >>>>         - agno = 0
> >>>>         - agno = 1
> >>>>         - agno = 2
> >>>> Metadata CRC error detected at 0x55a30e420c7d, xfs_bmbt block 0x51d68/0x1000
> >>>>         - agno = 3
> >>>> Metadata CRC error detected at 0x55a30e420c7d, xfs_bmbt block 0x51d68/0x1000
> >>>> btree block 0/41901 is suspect, error -74
> >>>> bad magic # 0x58534c4d in inode 3306572 (data fork) bmbt block 41901
> >>>> bad data fork in inode 3306572
> >>>> cleared inode 3306572
> >>>> ...
> >>>> Phase 7 - verify and correct link counts...
> >>>> Metadata corruption detected at 0x55a30e420b58, xfs_bmbt block 0x51d68/0x1000
> >>>> libxfs_bwrite: write verifier failed on xfs_bmbt bno 0x51d68/0x8
> >>>> xfs_repair: Releasing dirty buffer to free list!
> >>>> xfs_repair: Refusing to write a corrupt buffer to the data device!
> >>>> xfs_repair: Lost a write to the data device!
> >>>>
> >>>> fatal error -- File system metadata writeout failed, err=117.  Re-run xfs_repair.
> >>>>
> >>>>
> >>>> $ xfs_db test.img
> >>>> xfs_db> inode 3306572
> >>>> xfs_db> p
> >>>> core.magic = 0x494e
> >>>> core.mode = 0100666		  // regular file
> >>>> core.version = 3
> >>>> core.format = 3 (btree)	
> >>>> ...
> >>>> u3.bmbt.keys[1] = [startoff]
> >>>> 1:[6]
> >>>> u3.bmbt.ptrs[1] = 41901	 // btree root
> >>>> ...
> >>>>
> >>>> $ hexdump -C -n 4096 41901.img
> >>>> 00000000  58 53 4c 4d 00 00 00 00  00 00 01 e8 d6 f4 03 14  |XSLM............|
> >>>> 00000010  09 f3 a6 1b 0a 3c 45 5a  96 39 41 ac 09 2f 66 99  |.....<EZ.9A../f.|
> >>>> 00000020  00 00 00 00 00 05 1f fb  00 00 00 00 00 05 1d 68  |...............h|
> >>>> ...
> >>>>
> >>>> The block data associated with inode 3306572 is abnormal, but check the CRC first
> >>>> when reading. If the CRC check fails, badcrc will be set. Then the dirty flag
> >>>> will be set on bp when badcrc is set. In the final stage of repair, the dirty bp
> >>>> will be refreshed in batches. When refresh to the disk, the data in bp will be
> >>>> verified. At this time, if the data verification fails, resulting in a repair
> >>>> error.
> >>>>
> >>>> After scan_bmapbt returns an error, the inode will be cleaned up. Then bp
> >>>> doesn't need to set dirty flag, so that it won't trigger writeback verification
> >>>> failure.
> >>>>
> >>>> Signed-off-by: Wu Guanghao <wuguanghao3@xxxxxxxxxx>
> >>>> ---
> >>>>  repair/scan.c | 2 +-
> >>>>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/repair/scan.c b/repair/scan.c
> >>>> index 7b720131..b5458eb8 100644
> >>>> --- a/repair/scan.c
> >>>> +++ b/repair/scan.c
> >>>> @@ -185,7 +185,7 @@ scan_lbtree(
> >>>>
> >>>>  	ASSERT(dirty == 0 || (dirty && !no_modify));
> >>>>
> >>>> -	if ((dirty || badcrc) && !no_modify) {
> >>>> +	if (!err && (dirty || badcrc) && !no_modify) {
> >>>>  		libxfs_buf_mark_dirty(bp);
> >>>>  		libxfs_buf_relse(bp);
> >>>
> >>> Hm.  So if scan_lbtree returns 1, that means that we clear the inode.
> >>> Hence there's no point in dirtying this buffer since we're going to zap
> >>> the whole inode anyway.
> >>>
> >>> This looks correct to me, so
> >>> Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
> >>>
> >>> But that said, you could refactor this part:
> >>>
> >>> 	if (!err && (dirty || badcrc) && !no_modify)
> >>> 		libxfs_buf_mark_dirty(bp);
> >>> 	libxfs_buf_relse(bp);
> >>>
> >>> More questions: Let's say that the btree-format fork has this btree:
> >>>
> >>>         fork
> >>>        / | \
> >>>       A  B  C
> >>>
> >>> Are there any cases where A is corrupt enough that the write verifier
> >>> will trip but scan_lbtree/scan_bmapbt return 0?
> >>>
> >> I'm sorry for replying so late.
> > 
> > Don't worry about it, I'm just as slow. :)
> > 
> >> This situation should not exist.
> >> scan_bmapbt() performs the following checks:
> >> 1. Check the magic of the block
> >> 2. Check the level of the block
> >> 3. Check which inode the owner of the block belongs to.
> >> 4. If it is a V5 filesystem,it will check three items: UUID, block consistency,
> >> and which inode the block belongs to.
> >> 5. Calculate which AG the block belongs to and see the usage of the block
> >> 6. If it is a leaf node, it will check whether the number of records exceeds the maximum value
> >>
> >> xfs_bmbt_write_verify() performs the following checks:
> >> 1. Check the magic of the block
> >> 2. If it is a V5 filesystem,it will check three items: UUID, block consistency,
> >> and which inode the block belongs to.
> >> 3. Check if the level of the block is within the specified range
> >> 4. Check if the number of nodes in the block record exceeds the maximum limit
> >> 5. Check if the left and right nodes of the block are within the range of the file system
> >>
> >> As can be seen from above, scan_bmapbt() checks more and in more detail than
> >> xfs_bmbt_write_verify(). Therefore, if scan_bmapbt() returns 0,
> >> xfs_bmbt_write_verify() will not report an error.
> >>
> >>> Or, let's say that we dirty A, then scan_bmapbt decides that B is total
> >>> garbage and returns 1.  Should we then mark A stale so that it doesn't
> >>> get written out unnecessarily?
> >>>
> >> We can allocate space in process_btinode() and pass it to scan_lbtree/scan_bmapbt.
> >> If a bp is set as dirty, we record it. If the inode needs to be cleaned up,
> >> we set all recorded bps as stale.However, this does not affect the repair process.
> >> Even if no record is kept, it only adds some unnecessary data writing.
> >>
> >> If there is nothing wrong with this，I will push V2 patch.
> > 
> > Hmm.  It's tempting to have scan_bmapbt put all the buffers it finds on
> > a list.  The corrected ones would be marked dirty, the good ones just
> > end up on the list.  If we decide to kill the bmbt we can then
> > invalidate all the buffers.  If we keep it, then we can write the dirty
> > blocks.
> > 
> > Ugh.  But that gets gross -- if the bmbt is larger than memory, we then
> > can end up OOMing xfs_repair.  Creating an interval bitmap of fsblock
> > numbers visited buffers might be less bad, but who knows.
> > 
> > (Or I guess we could just apply this patch and see if anyone complains
> > about A being written after we decided to kill the bmbt due to B. ;))
> > 
> 
> OK, I agree. Do I need to resend the patch or do something else?

Yeah, you might as well resend it with my RVB tag attached.

--D

> > --D
> > 
> >> Thanks
> >>
> >> Guanghao
> >>
> >>> Or, let's say that A is corrupt enough to trip the write verifier but
> >>> scan_lbtree/scan_bmapbt return 0; and B is corrupt enough that
> >>> scan_bmapbt returns 1.  In that case, we'd need to mark A stale so that
> >>> we clear the inode and repair can complete without tripping over A or B.
> >>> Does that actually happen?
> >>>
> >>
> >>> --D
> >>>
> >>>>  	}
> >>>> -- 
> >>>> 2.27.0
> >>>>
> >>> .
> >>>
> > .
> >