On Tue, Jun 22, 2021 at 02:06:04PM +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > The iclogbuf ring attached to the struct xlog is circular, hence the > first and last iclogs in the ring can only be determined by > comparing them against the log->l_iclog pointer. > > In xfs_cil_push_work(), we want to wait on previous iclogs that were > issued so that we can flush them to stable storage with the commit > record write, and it simply waits on the previous iclog in the ring. > This, however, leads to CIL push hangs in generic/019 like so: > > task:kworker/u33:0 state:D stack:12680 pid: 7 ppid: 2 flags:0x00004000 > Workqueue: xfs-cil/pmem1 xlog_cil_push_work > Call Trace: > __schedule+0x30b/0x9f0 > schedule+0x68/0xe0 > xlog_wait_on_iclog+0x121/0x190 > ? wake_up_q+0xa0/0xa0 > xlog_cil_push_work+0x994/0xa10 > ? _raw_spin_lock+0x15/0x20 > ? xfs_swap_extents+0x920/0x920 > process_one_work+0x1ab/0x390 > worker_thread+0x56/0x3d0 > ? rescuer_thread+0x3c0/0x3c0 > kthread+0x14d/0x170 > ? __kthread_bind_mask+0x70/0x70 > ret_from_fork+0x1f/0x30 > > With other threads blocking in either xlog_state_get_iclog_space() > waiting for iclog space or xlog_grant_head_wait() waiting for log > reservation space. > > The problem here is that the previous iclog on the ring might > actually be a future iclog. That is, if log->l_iclog points at > commit_iclog, commit_iclog is the first (oldest) iclog in the ring > and there are no previous iclogs pending as they have all completed > their IO and been activated again. IOWs, commit_iclog->ic_prev > points to an iclog that will be written in the future, not one that > has been written in the past. > > Hence, in this case, waiting on the ->ic_prev iclog is incorrect > behaviour, and depending on the state of the future iclog, we can > end up with a circular ABA wait cycle and we hang. > > The fix is made more complex by the fact that many iclogs states > cannot be used to determine if the iclog is a past or future iclog. > Hence we have to determine past iclogs by checking the LSN of the > iclog rather than their state. A past ACTIVE iclog will have a LSN > of zero, while a future ACTIVE iclog will have a LSN greater than > the current iclog. We don't wait on either of these cases. > > Similarly, a future iclog that hasn't completed IO will have an LSN > greater than the current iclog and so we don't wait on them. A past > iclog that is still undergoing IO completion will have a LSN less > than the current iclog and those are the only iclogs that we need to > wait on. > > Hence we can use the iclog LSN to determine what iclogs we need to > wait on here. > > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO") > Reported-by: Brian Foster <bfoster@xxxxxxxxxx> > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> Ok, looks good to me. Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> --D > --- > fs/xfs/xfs_log_cil.c | 35 ++++++++++++++++++++++++++++------- > 1 file changed, 28 insertions(+), 7 deletions(-) > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c > index 27bed1d9cf29..83a932878177 100644 > --- a/fs/xfs/xfs_log_cil.c > +++ b/fs/xfs/xfs_log_cil.c > @@ -877,7 +877,7 @@ xlog_cil_push_work( > * Once we attach the ctx to the iclog, a shutdown can process the > * iclog, run the callbacks and free the ctx. The only thing preventing > * this potential UAF situation here is that we are holding the > - * icloglock. Hence we cannot access the ctx after we have attached the > + * icloglock. Hence we cannot access the ctx once we have attached the > * callbacks and dropped the icloglock. > */ > spin_lock(&log->l_icloglock); > @@ -900,17 +900,38 @@ xlog_cil_push_work( > spin_unlock(&cil->xc_push_lock); > > /* > - * If the checkpoint spans multiple iclogs, wait for all previous > - * iclogs to complete before we submit the commit_iclog. In this case, > - * the commit_iclog write needs to issue a pre-flush so that the > - * ordering is correctly preserved down to stable storage. > + * If the checkpoint spans multiple iclogs, wait for all previous iclogs > + * to complete before we submit the commit_iclog. We can't use state > + * checks for this - ACTIVE can be either a past completed iclog or a > + * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a > + * past or future iclog awaiting IO or ordered IO completion to be run. > + * In the latter case, if it's a future iclog and we wait on it, the we > + * will hang because it won't get processed through to ic_force_wait > + * wakeup until this commit_iclog is written to disk. Hence we use the > + * iclog header lsn and compare it to the commit lsn to determine if we > + * need to wait on iclogs or not. > * > * NOTE: It is not safe reference the ctx after this check as we drop > * the icloglock if we have to wait for completion of other iclogs. > */ > if (ctx->start_lsn != commit_lsn) { > - xlog_wait_on_iclog(commit_iclog->ic_prev); > - spin_lock(&log->l_icloglock); > + xfs_lsn_t plsn; > + > + plsn = be64_to_cpu(commit_iclog->ic_prev->ic_header.h_lsn); > + if (plsn && XFS_LSN_CMP(plsn, commit_lsn) < 0) { > + /* > + * Waiting on ic_force_wait orders the completion of > + * iclogs older than ic_prev. Hence we only need to wait > + * on the most recent older iclog here. > + */ > + xlog_wait_on_iclog(commit_iclog->ic_prev); > + spin_lock(&log->l_icloglock); > + } > + > + /* > + * We need to issue a pre-flush so that the ordering for this > + * checkpoint is correctly preserved down to stable storage. > + */ > commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH; > } > > -- > 2.31.1 >