On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote: > On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote: > > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote: > > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote: > > > > When lazysbcount is enabled, multiple threads stress test the xfs report > > > > the following problems: > > We've had lazy sb counters for 15 years and just about every XFS > filesystem in production uses them, so providing us with some idea > of the scope of the problem and how to reproduce it would be greatly > appreciated. > > What stress test are you running? What filesystem config does it > manifest on (other than lazysbcount=1)? How long does the stress > test run for, and where/why does log recovery get run in this stress > test? > > > > > XFS (loop0): SB summary counter sanity check failed > > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify > > > > +0x13b/0x460, xfs_sb block 0x0 > > > > XFS (loop0): Unmount and run xfs_repair > > > > XFS (loop0): First 128 bytes of corrupted metadata buffer: > > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(.. > > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z > > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... .......... > > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................ > > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................ > > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................ > > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................ > > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply > > > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem. > > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s) > > > > XFS (loop0): log mount/recovery failed: error -117 > > > > XFS (loop0): log mount failed > > > > > > > > The cause of the problem is that during the log recovery process, incorrect > > > > icount and ifree are recovered from the log and fail to pass the size check > > > > > > Are you saying that the log contained a transaction in which ifree > > > > icount? > > > > Yes, this situation is possible. For example consider the following sequence: > > > > CPU0 CPU1 > > xfs_log_sb xfs_trans_unreserve_and_mod_sb > > ---------- ------------------------------ > > percpu_counter_sum(&mp->m_icount) > > percpu_counter_add(&mp->m_icount, idelta) > > percpu_counter_add_batch(&mp->m_icount, > > idelta, XFS_ICOUNT_BATCH) > > percpu_counter_sum(&mp->m_ifree) > > What caused the xfs_log_sb() to be called? Very few things > actually log the superblock this way at runtime - it's generally > only logged directly like this when a feature bit changes during a > transaction (rare) or at a synchronisation point when everything > else is idle and there's no chance of a race like this occurring... > > I can see a couple of routes to this occurring via feature bit > modification, but I don't see them being easy to hit or something > that would exist for very long in the journal. Hence I'm wondering > if there should be runtime protection for xfs_log_sb() to avoid > these problems.... Maybe. Or perhaps we sample m_i{count,free} until they come up with a value that will pass the verifier, and only then log the new values to the primary super xfs_buf? > > > > in xfs_validate_sb_write(). > > > > > > > > With lazysbcount is enabled, There is no additional lock protection for > > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies > > > > the m_ifree between the read m_icount and the m_ifree, this will make the > > > > m_ifree larger than m_icount and written to the log. If we have an unclean > > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding > > > > the counters from the AGF block counts, and the correction is later than > > > > log recovery. During log recovery, incorrect ifree/icount may be restored > > > > from the log and written to the super block, since ifree and icount have > > > > not been corrected at this time, the relationship between ifree and icount > > > > cannot be checked in xfs_validate_sb_write(). > > > > > > > > So, don't check the size between ifree and icount in xfs_validate_sb_write() > > > > when lazysbcount is enabled. > > > > > > Um, doesn't that neuter a sanity check on all V5 filesystems? > > > > Yes, such modifications don't look like the best way, all sb write checks > > will be affect. Maybe it can increase the judgment of clean mount and reduce > > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after > > re-initialise incore superblock counters, like this: > > I'm not sure that silencing the warning and continuing log recovery > with an invalid accounting state is the best way to proceed. We > haven't yet recovered unlinked inodes at the time this warning > fires. If ifree really is larger than icount, then we've got a > real problem at this point in log recovery. > > Hence I suspect that we should be looking at fixing the runtime race > condition that leads to the problem, not trying to work around > inconsistency in the recovery code. Agreed. Higher level code shouldn't be storing garbage ifree/icount to the primary superblock xfs_buf, period. --D > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx