On Thu, Jan 13, 2022 at 08:37:00AM -0500, Brian Foster wrote: > The xfs_inodegc_stop() helper performs a high level flush of pending > work on the percpu queues and then runs a cancel_work_sync() on each > of the percpu work tasks to ensure all work has completed before > returning. While cancel_work_sync() waits for wq tasks to complete, > it does not guarantee work tasks have started. This means that the > _stop() helper can queue and instantly cancel a wq task without > having completed the associated work. This can be observed by > tracepoint inspection of a simple "rm -f <file>; fsfreeze -f <mnt>" > test: > > xfs_destroy_inode: ... ino 0x83 ... > xfs_inode_set_need_inactive: ... ino 0x83 ... > xfs_inodegc_stop: ... > ... > xfs_inodegc_start: ... > xfs_inodegc_worker: ... > xfs_inode_inactivating: ... ino 0x83 ... > > The first few lines show that the inode is removed and need inactive > state set, but the inactivation work has not completed before the > inodegc mechanism stops. The inactivation doesn't actually occur > until the fs is unfrozen and the gc mechanism starts back up. Note > that this test requires fsfreeze to reproduce because xfs_freeze > indirectly invokes xfs_fs_statfs(), which calls xfs_inodegc_flush(). > > When this occurs, the workqueue try_to_grab_pending() logic first > tries to steal the pending bit, which does not succeed because the > bit has been set by queue_work_on(). Subsequently, it checks for > association of a pool workqueue from the work item under the pool > lock. This association is set at the point a work item is queued and > cleared when dequeued for processing. If the association exists, the > work item is removed from the queue and cancel_work_sync() returns > true. If the pwq association is cleared, the remove attempt assumes > the task is busy and retries (eventually returning false to the > caller after waiting for the work task to complete). > > To avoid this race, we can flush each work item explicitly before > cancel. However, since the _queue_all() already schedules each > underlying work item, the workqueue level helpers are sufficient to > achieve the same ordering effect. E.g., the inodegc enabled flag > prevents scheduling any further work in the _stop() case. Use the > drain_workqueue() helper in this particular case to make the intent > a bit more self explanatory. /me wonders why he didn't think of drain/flush_workqueue in the first place. Hm. Well, the logic looks sound so, Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx> and I'll go give this a spin. --D > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > --- > fs/xfs/xfs_icache.c | 22 ++++------------------ > 1 file changed, 4 insertions(+), 18 deletions(-) > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index d019c98eb839..7a2a5e2be3cf 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -1852,28 +1852,20 @@ xfs_inodegc_worker( > } > > /* > - * Force all currently queued inode inactivation work to run immediately, and > - * wait for the work to finish. Two pass - queue all the work first pass, wait > - * for it in a second pass. > + * Force all currently queued inode inactivation work to run immediately and > + * wait for the work to finish. > */ > void > xfs_inodegc_flush( > struct xfs_mount *mp) > { > - struct xfs_inodegc *gc; > - int cpu; > if (!xfs_is_inodegc_enabled(mp)) > return; > > trace_xfs_inodegc_flush(mp, __return_address); > > xfs_inodegc_queue_all(mp); > - > - for_each_online_cpu(cpu) { > - gc = per_cpu_ptr(mp->m_inodegc, cpu); > - flush_work(&gc->work); > - } > + flush_workqueue(mp->m_inodegc_wq); > } > > /* > @@ -1884,18 +1876,12 @@ void > xfs_inodegc_stop( > struct xfs_mount *mp) > { > - struct xfs_inodegc *gc; > - int cpu; > - > if (!xfs_clear_inodegc_enabled(mp)) > return; > > xfs_inodegc_queue_all(mp); > + drain_workqueue(mp->m_inodegc_wq); > > - for_each_online_cpu(cpu) { > - gc = per_cpu_ptr(mp->m_inodegc, cpu); > - cancel_work_sync(&gc->work); > - } > trace_xfs_inodegc_stop(mp, __return_address); > } > > -- > 2.31.1 >