Re: [PATCH] xfs: use ordered buffers to initialize dquot buffers during quotacheck

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, May 13, 2020 at 12:12:23PM -0400, Brian Foster wrote:
> On Wed, May 13, 2020 at 08:26:28AM -0700, Darrick J. Wong wrote:
> > On Wed, May 13, 2020 at 09:14:15AM -0400, Brian Foster wrote:
> > > On Tue, May 12, 2020 at 02:00:33PM -0700, Darrick J. Wong wrote:
> > > > From: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > 
> > > > While QAing the new xfs_repair quotacheck code, I uncovered a quota
> > > > corruption bug resulting from a bad interaction between dquot buffer
> > > > initialization and quotacheck.  The bug can be reproduced with the
> > > > following sequence:
> > > > 
> > > > # mkfs.xfs -f /dev/sdf
> > > > # mount /dev/sdf /opt -o usrquota
> > > > # su nobody -s /bin/bash -c 'touch /opt/barf'
> > > > # sync
> > > > # xfs_quota -x -c 'report -ahi' /opt
> > > > User quota on /opt (/dev/sdf)
> > > >                         Inodes
> > > > User ID      Used   Soft   Hard Warn/Grace
> > > > ---------- ---------------------------------
> > > > root            3      0      0  00 [------]
> > > > nobody          1      0      0  00 [------]
> > > > 
> > > > # xfs_io -x -c 'shutdown' /opt
> > > > # umount /opt
> > > > # mount /dev/sdf /opt -o usrquota
> > > > # touch /opt/man2
> > > > # xfs_quota -x -c 'report -ahi' /opt
> > > > User quota on /opt (/dev/sdf)
> > > >                         Inodes
> > > > User ID      Used   Soft   Hard Warn/Grace
> > > > ---------- ---------------------------------
> > > > root            1      0      0  00 [------]
> > > > nobody          1      0      0  00 [------]
> > > > 
> > > > # umount /opt
> > > > 
> > > > Notice how the initial quotacheck set the root dquot icount to 3
> > > > (rootino, rbmino, rsumino), but after shutdown -> remount -> recovery,
> > > > xfs_quota reports that the root dquot has only 1 icount.  We haven't
> > > > deleted anything from the filesystem, which means that quota is now
> > > > under-counting.  This behavior is not limited to icount or the root
> > > > dquot, but this is the shortest reproducer.
> > > > 
> > > > I traced the cause of this discrepancy to the way that we handle ondisk
> > > > dquot updates during quotacheck vs. regular fs activity.  Normally, when
> > > > we allocate a disk block for a dquot, we log the buffer as a regular
> > > > (dquot) buffer.  Subsequent updates to the dquots backed by that block
> > > > are done via separate dquot log item updates, which means that they
> > > > depend on the logged buffer update being written to disk before the
> > > > dquot items.  Because individual dquots have their own LSN fields, that
> > > > initial dquot buffer must always be recovered.
> > > > 
> > > > However, the story changes for quotacheck, which can cause dquot block
> > > > allocations but persists the final dquot counter values via a delwri
> > > > list.  Because recovery doesn't gate dquot buffer replay on an LSN, this
> > > > means that the initial dquot buffer can be replayed over the (newer)
> > > > contents that were delwritten at the end of quotacheck.  In effect, this
> > > > re-initializes the dquot counters after they've been updated.  If the
> > > > log does not contain any other dquot items to recover, the obsolete
> > > > dquot contents will not be corrected by log recovery.
> > > > 
> > > > Because quotacheck uses a transaction to log the setting of the CHKD
> > > > flags in the superblock, we skip quotacheck during the second mount
> > > > call, which allows the incorrect icount to remain.
> > > > 
> > > > Fix this by changing the ondisk dquot initialization function to use
> > > > ordered buffers to write out fresh dquot blocks if it detects that we're
> > > > running quotacheck.  If the system goes down before quotacheck can
> > > > complete, the CHKD flags will not be set in the superblock and the next
> > > > mount will run quotacheck again, which can fix uninitialized dquot
> > > > buffers.  This requires amending the defer code to maintaine ordered
> > > > buffer state across defer rolls for the sake of the dquot allocation
> > > > code.
> > > > 
> > > > For regular operations we preserve the current behavior since the dquot
> > > > items require properly initialized ondisk dquot records.
> > > > 
> > > > Signed-off-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_defer.c |   10 ++++++++-
> > > >  fs/xfs/xfs_dquot.c        |   51 ++++++++++++++++++++++++++++++++++-----------
> > > >  2 files changed, 47 insertions(+), 14 deletions(-)
> > > > 
> > > ...  
> > > > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > > > index 7f39dd24d475..cf8b2f4de587 100644
> > > > --- a/fs/xfs/xfs_dquot.c
> > > > +++ b/fs/xfs/xfs_dquot.c
> > > ...
> > > > @@ -277,11 +279,34 @@ xfs_qm_init_dquot_blk(
> > > >  		}
> > > >  	}
> > > >  
> > > > -	xfs_trans_dquot_buf(tp, bp,
> > > > -			    (type & XFS_DQ_USER ? XFS_BLF_UDQUOT_BUF :
> > > > -			    ((type & XFS_DQ_PROJ) ? XFS_BLF_PDQUOT_BUF :
> > > > -			     XFS_BLF_GDQUOT_BUF)));
> > > > -	xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > > +	if (type & XFS_DQ_USER) {
> > > > +		qflag = XFS_UQUOTA_CHKD;
> > > > +		blftype = XFS_BLF_UDQUOT_BUF;
> > > > +	} else if (type & XFS_DQ_PROJ) {
> > > > +		qflag = XFS_PQUOTA_CHKD;
> > > > +		blftype = XFS_BLF_PDQUOT_BUF;
> > > > +	} else {
> > > > +		qflag = XFS_GQUOTA_CHKD;
> > > > +		blftype = XFS_BLF_GDQUOT_BUF;
> > > > +	}
> > > > +
> > > > +	xfs_trans_dquot_buf(tp, bp, blftype);
> > > > +
> > > > +	/*
> > > > +	 * If the CHKD flag isn't set, we're running quotacheck and need to use
> > > > +	 * ordered buffers so that the logged initialization buffer does not
> > > > +	 * get replayed over the delwritten quotacheck buffer.  If we crash
> > > > +	 * before the end of quotacheck, the CHKD flags will not be set in the
> > > > +	 * superblock and we'll re-run quotacheck at next mount.
> > > > +	 *
> > > > +	 * Outside of quotacheck, dquot updates are logged via dquot items and
> > > > +	 * we must use the regular buffer logging mechanisms to ensure that the
> > > > +	 * initial buffer state is recovered before dquot items.
> > > > +	 */
> > > > +	if (mp->m_qflags & qflag)
> > > > +		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);
> > > > +	else
> > > > +		xfs_trans_ordered_buf(tp, bp);
> > > 
> > > Ok, I think I follow what's going on here. quotacheck runs and allocates
> > > a dquot block and inits it in a transaction. Some time later quotacheck
> > > updates dquots in said block and queues the buffer in its own delwri
> > > list. Quotacheck completes, the buffer is written back and then the
> > > filesystem immediately crashes. We replay the content of the alloc/init
> > > transaction over the updated content in the block on disk. This isn't a
> > > problem outside of quotacheck because a subsequent dquot update would
> > > have also been logged instead of directly written back.
> > 
> > Correct.
> > 
> > > Therefore, the purpose of the change above is primarily to avoid logging
> > > the init of the dquot buffer during quotacheck. That makes sense, but
> > > what about the integrity of this particular transaction? For example,
> > > what happens if this transaction commits to the on-disk log and we crash
> > > before quotacheck completes and the buffer is written back? I'm assuming
> > > recovery would replay the dquot allocation but not the initialization,
> > > then quotacheck would run again and find the buffer in an unknown state
> > > (perhaps failing a read verifier?). Hm?
> > 
> > Yes, the read verifier can fail in quotacheck, but quotacheck reacts to
> > that by re-trying the buffer read with a NULL buffer ops (in
> > xfs_qm_reset_dqcounts_all).  Next, it calls xfs_qm_reset_dqcounts, which
> > calls xfs_dqblk_repair to reinitialize the dquot records.  After that,
> > xfs_qm_reset_dqcounts resets even more of the dquot state (counters,
> > timers, flags, warns).
> > 
> 
> Ok, I was thinking we might get back to this "read or allocate" path but
> it sounds like the earlier reset phase reads every chunk, does this
> verifier dance and reinits if necessary. It's somewhat unfortunate that
> we can create those conditions intentionally, but I suppose that's
> better than the current state in that we at least recover from it. We
> should probably document that somewhere to explain why ordering the
> buffer like this is safe in quotacheck context.

Yeah, I was trying to get everyone there in the comment above, but maybe
this ought to be more explicitly stated in the comment.  The current
version of that looks like:

	/*
	 * When quotacheck runs, we use delayed writes to update all the
	 * dquots on disk in an efficient manner instead of logging the
	 * individual dquot changes as they are made.
	 *
	 * Hence if we log the buffer that we allocate here, then crash
	 * post-quotacheck while the logged initialisation is still in
	 * the active region of the log, we can lose the information
	 * quotacheck wrote directly to the buffer. That is, log
	 * recovery will replay the dquot buffer initialisation over the
	 * top of whatever information quotacheck had written to the
	 * buffer.
	 *
	 * To avoid this problem, dquot allocation during quotacheck
	 * needs to avoid logging the initialised buffer, but we still
	 * need to have writeback of the buffer pin the tail of the log
	 * so that it is initialised on disk before we remove the
	 * allocation transaction from the active region of the log.
	 * Marking the buffer as ordered instead of logging it provides
	 * this behaviour.
	 *
	 * If we crash before quotacheck completes, a subsequent
	 * quotacheck run will re-allocate and re-initialize the dquot
	 * records as needed.
	 */
	if (!(mp->m_qflags & qflag))
		xfs_trans_ordered_buf(tp, bp);
	else
		xfs_trans_log_buf(tp, bp, 0, BBTOB(q->qi_dqchunklen) - 1);


--D

> Brian
> 
> > In short, quotacheck will fix anything it doesn't like about the dquot
> > buffers that are attached to the quota inodes.
> > 
> > --D
> > 
> > > Brian
> > > 
> > > >  }
> > > >  
> > > >  /*
> > > > 
> > > 
> > 
> 



[Index of Archives]     [XFS Filesystem Development (older mail)]     [Linux Filesystem Development]     [Linux Audio Users]     [Yosemite Trails]     [Linux Kernel]     [Linux RAID]     [Linux SCSI]


  Powered by Linux