On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote: > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > > While QAing the new xfs_repair quotacheck code, I uncovered a quota > corruption bug resulting from a bad interaction between dquot buffer > initialization and quotacheck. The bug can be reproduced with the > following sequence: > > # mkfs.xfs -f /dev/sdf > # mount /dev/sdf /opt -o usrquota > # su nobody -s /bin/bash -c 'touch /opt/barf' > # sync > # xfs_quota -x -c 'report -ahi' /opt > User quota on /opt (/dev/sdf) > Inodes > User ID Used Soft Hard Warn/Grace > ---------- --------------------------------- > root 3 0 0 00 [------] > nobody 1 0 0 00 [------] > > # xfs_io -x -c 'shutdown' /opt > # umount /opt > # mount /dev/sdf /opt -o usrquota > # touch /opt/man2 > # xfs_quota -x -c 'report -ahi' /opt > User quota on /opt (/dev/sdf) > Inodes > User ID Used Soft Hard Warn/Grace > ---------- --------------------------------- > root 1 0 0 00 [------] > nobody 1 0 0 00 [------] > > # umount /opt > > Notice how the initial quotacheck set the root dquot icount to 3 > (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery, > xfs_quota reports that the root dquot has only 1 icount. We haven't > deleted anything from the filesystem, which means that quota is now > under-counting. This behavior is not limited to icount or the root > dquot, but this is the shortest reproducer. > > I traced the cause of this discrepancy to the way that we handle ondisk > dquot updates during quotacheck vs. regular fs activity. Normally, when > we allocate a disk block for a dquot, we log the buffer as a regular > (dquot) buffer. Subsequent updates to the dquots backed by that block > are done via separate dquot log item updates, which means that they > depend on the logged buffer update being written to disk before the > dquot items. Because individual dquots have their own LSN fields, that > initial dquot buffer must always be recovered. > > However, the story changes for quotacheck, which can cause dquot block > allocations but persists the final dquot counter values via a delwri > list. Because recovery doesn't gate dquot buffer replay on an LSN, this > means that the initial dquot buffer can be replayed over the (newer) > contents that were delwritten at the end of quotacheck. In effect, this > re-initializes the dquot counters after they've been updated. If the > log does not contain any other dquot items to recover, the obsolete > dquot contents will not be corrected by log recovery. > > Because quotacheck uses a transaction to log the setting of the CHKD > flags in the superblock, we skip quotacheck during the second mount > call, which allows the incorrect icount to remain. > > Fix this by changing the ondisk dquot initialization function to use > ordered buffers to write out fresh dquot blocks if it detects that we're > running quotacheck. If the system goes down before quotacheck can > complete, the CHKD flags will not be set in the superblock and the next > mount will run quotacheck again, which can fix uninitialized dquot > buffers. This requires amending the defer code to maintaine ordered > buffer state across defer rolls for the sake of the dquot allocation > code. > > For regular operations we preserve the current behavior since the dquot > items require properly initialized ondisk dquot records. > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx> > --- > fs/xfs/libxfs/xfs_defer.c | 10 ++++++++- > fs/xfs/xfs_dquot.c | 51 ++++++++++++++++++++++++++++++++++----------- > 2 files changed, 47 insertions(+), 14 deletions(-) > ... > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c > index 7f39dd24d475..cf8b2f4de587 100644 > --- a/fs/xfs/xfs_dquot.c > +++ b/fs/xfs/xfs_dquot.c ... > @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk( > } > } > > - xfs_trans_dquot_buf(tp, bp, > - (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF : > - ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF : > - XFS_BLF_GDQUOT_BUF))); > - xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1); > + if (type & XFS_DQ_USER) { > + qflag = XFS_UQUOTA_CHKD; > + blftype = XFS_BLF_UDQUOT_BUF; > + } else if (type & XFS_DQ_PROJ) { > + qflag = XFS_PQUOTA_CHKD; > + blftype = XFS_BLF_PDQUOT_BUF; > + } else { > + qflag = XFS_GQUOTA_CHKD; > + blftype = XFS_BLF_GDQUOT_BUF; > + } > + > + xfs_trans_dquot_buf(tp, bp, blftype); > + > + /* > + * If the CHKD flag isn't set, we're running quotacheck and need to use > + * ordered buffers so that the logged initialization buffer does not > + * get replayed over the delwritten quotacheck buffer. If we crash > + * before the end of quotacheck, the CHKD flags will not be set in the > + * superblock and we'll re-run quotacheck at next mount. > + * > + * Outside of quotacheck, dquot updates are logged via dquot items and > + * we must use the regular buffer logging mechanisms to ensure that the > + * initial buffer state is recovered before dquot items. > + */ > + if (mp->m_qflags & qflag) > + xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1); > + else > + xfs_trans_ordered_buf(tp, bp); Ok, I think I follow what's going on here. quotacheck runs and allocates a dquot block and inits it in a transaction. Some time later quotacheck updates dquots in said block and queues the buffer in its own delwri list. Quotacheck completes, the buffer is written back and then the filesystem immediately crashes. We replay the content of the alloc/init transaction over the updated content in the block on disk. This isn't a problem outside of quotacheck because a subsequent dquot update would have also been logged instead of directly written back. Therefore, the purpose of the change above is primarily to avoid logging the init of the dquot buffer during quotacheck. That makes sense, but what about the integrity of this particular transaction? For example, what happens if this transaction commits to the on-disk log and we crash before quotacheck completes and the buffer is written back? I'm assuming recovery would replay the dquot allocation but not the initialization, then quotacheck would run again and find the buffer in an unknown state (perhaps failing a read verifier?). Hm? Brian > } > > /* >