Re: [PATCH 4/5] xfs: log forces imply data device cache flushes

Dave Chinner <david@xxxxxxxxxxxxx> · Fri, 23 Jul 2021 08:12:58 +1000

On Thu, Jul 22, 2021 at 12:30:18PM -0700, Darrick J. Wong wrote:
> On Thu, Jul 22, 2021 at 11:53:34AM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> > 
> > After fixing the tail_lsn vs cache flush race, generic/482 continued
> > to fail in a similar way where cache flushes were missing before
> > iclog FUA writes. Tracing of iclog state changes during the fsstress
> 
> Heh. ;)
....
> > +		 * xlog_cil_force_seq() call, but there are other writers still
> > +		 * accessing it so it hasn't been pushed to disk yet. Like the
> > +		 * ACTIVE case above, we need to make sure caches are flushed
> > +		 * when this iclog is written.
> > +		 */
> > +		iclog->ic_flags |= XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA;
> > +		if (log_flushed)
> > +			*log_flushed = 1;
> > +		break;
> > +	default:
> > +		/*
> > +		 * The entire checkpoint was written by the CIL force and is on
> > +		 * it's way to disk already. It will be stable when it
> 
> s/it's/its/
> 
> So now that we're at the end of this series, what are the rules for when
> when issue cache flushes and FUA writes?
> 
> - Writing the unmount record always flushes the data and log devices.
>   Does it need to flush the rt device too?  I guess xfs_free_buftarg
>   does that.

Correct. RT device behaviour is unchanged as it only contains data
and data is already guaranteed to be on stable storage before we
write the unmount record.

> - Start an async flush of the data device when doing CIL push work so
>   that anything the AIL wrote to disk (and pushed the tail) is persisted
>   before we assign a tail to the log record that we write at the end?
> 
> - If any other AIL work completes (and pushes the tail ahead) by the
>   time we actually write the log record, flush the data device a second
>   time?

Yes.

> - If a log checkpoint spans multiple iclogs, flush the *log* device
>   before writing the iclog with the commit record in it.

Yes. And for internal logs we have the natural optimisation that
these two cases are handled by same cache flush and so for large
checkpoints on internal logs we don't see lot tail update races.

> - Any time we write an iclog that commits a checkpoint, write that
>   record with FUA to ensure it's persisted.

*nod*

> - If we're forcing the log to disk as part of an integrity operation
>   (fsync, syncfs, etc.) then issue cache flushes for ... each? iclog
>   written to disk?  And use FUA for that write too?

This is where it gets messy, because log forces are not based around
checkpoint completions. Hence we have no idea what is actually in
the iclog we are flushing and so must treat them all as if they
contain a commit record, close off a multi-iclog checkpoint, and
might have raced with a log tail update. We don't know - and can't
know from the iclog state - which conditions exist and so we have to
assume that at least one of the above states exist for any ACTIVE or
WANT_SYNC iclog we end flushing or up waiting on.

If the iclog is already on it's way to disk, and it contains a
commit record, then the cache flush requirements for
metadata/journal ordering have already been met and we don't need to
do anything other than wait. But if we have to flush the iclog or
wait for a flush by a third party, we need to ensure that cache
flushes occur so that the log force semantics are upheld.

If the iclog doesn't contain a commit record (i.e. a log force in
the middle of a new, racing checkpoint write) we don't actually care
if the iclog contains flushes or not, because a crash immediately
after the log force won't actually recover the checkpoint contained
in that iclog. From the log force perspective, the iclog contains
future changes, so we don't care about whether it can be recovered.
But we don't know this, so we have to issue cache flushes on every
iclog we flush from the log force code.

This is why I mentioned that the log force code needs to be turned
inside out to guarantee CIL checkpoints are flushed and stable
rather than iclogs. We care about whole checkpoints being
recoverable, not whether some random iclog in the middle of a
checkpoint write is stable....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx