On Thu, Jul 22, 2021 at 11:53:32AM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > The recent journal flush/FUA changes replaced the flushing of the > data device on every iclog write with an up-front async data device > cache flush. Unfortunately, the assumption of which this was based > on has been proven incorrect by the flush vs log tail update > ordering issue. As the fix for that issue uses the > XLOG_ICL_NEED_FLUSH flag to indicate that data device needs a cache > flush, we now need to (once again) ensure that an iclog write to > external logs that need a cache flush to be issued actually issue a > cache flush to the data device as well as the log device. > > Fixes: eef983ffeae7 ("xfs: journal IO cache flush reductions") > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/xfs_log.c | 19 +++++++++++-------- > 1 file changed, 11 insertions(+), 8 deletions(-) > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c > index 96434cc4df6e..a3c4d48195d9 100644 > --- a/fs/xfs/xfs_log.c > +++ b/fs/xfs/xfs_log.c > @@ -827,13 +827,6 @@ xlog_write_unmount_record( > /* account for space used by record data */ > ticket->t_curr_res -= sizeof(ulf); > > - /* > - * For external log devices, we need to flush the data device cache > - * first to ensure all metadata writeback is on stable storage before we > - * stamp the tail LSN into the unmount record. > - */ > - if (log->l_targ != log->l_mp->m_ddev_targp) > - blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev); > return xlog_write(log, &vec, ticket, NULL, NULL, XLOG_UNMOUNT_TRANS); > } > > @@ -1796,10 +1789,20 @@ xlog_write_iclog( > * metadata writeback and causing priority inversions. > */ > iclog->ic_bio.bi_opf = REQ_OP_WRITE | REQ_META | REQ_SYNC | REQ_IDLE; > - if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) > + if (iclog->ic_flags & XLOG_ICL_NEED_FLUSH) { > iclog->ic_bio.bi_opf |= REQ_PREFLUSH; > + /* > + * For external log devices, we also need to flush the data > + * device cache first to ensure all metadata writeback covered > + * by the LSN in this iclog is on stable storage. This is slow, > + * but it *must* complete before we issue the external log IO. I'm a little confused about what's going on here. We're about to write a log record to disk, with h_tail_lsn reflecting the tail of the log and h_lsn reflecting the current head of the log (i.e. this record). If the log tail has moved forward since the last log record was written and this fs has an external log, we need to flush the data device because the AIL could have written logged items back into the filesystem and we need to ensure those items have been persisted before we write to the log the fact that the tail moved forward. The AIL itself doesn't issue cache flushes (nor does it need to), so that's why we do that here. Why don't we need a flush like this if only FUA is set? Is it not possible to write a checkpoint that fits within a single iclog after the log tail has moved forward? --D > + */ > + if (log->l_targ != log->l_mp->m_ddev_targp) > + blkdev_issue_flush(log->l_mp->m_ddev_targp->bt_bdev); > + } > if (iclog->ic_flags & XLOG_ICL_NEED_FUA) > iclog->ic_bio.bi_opf |= REQ_FUA; > + > iclog->ic_flags &= ~(XLOG_ICL_NEED_FLUSH | XLOG_ICL_NEED_FUA); > > if (xlog_map_iclog_data(&iclog->ic_bio, iclog->ic_data, count)) { > -- > 2.31.1 >