On Tue, Aug 22, 2023 at 06:20:11PM -0700, Darrick J. Wong wrote: > On Wed, Aug 23, 2023 at 09:28:16AM +1000, Dave Chinner wrote: > > On Tue, Aug 22, 2023 at 11:30:16AM -0700, Darrick J. Wong wrote: > > > On Mon, Aug 21, 2023 at 03:17:52PM +1000, Dave Chinner wrote: > > > > On Fri, Aug 04, 2023 at 03:38:54PM -0700, Darrick J. Wong wrote: > > > > > From: Darrick J. Wong <djwong@xxxxxxxxxx> > > > > > > > > > > In commit 7c8ade2121200 ("xfs: implement percpu cil space used > > > > > calculation"), the XFS committed (log) item list code was converted to > > > > > use per-cpu lists and space tracking to reduce cpu contention when > > > > > multiple threads are modifying different parts of the filesystem and > > > > > hence end up contending on the log structures during transaction commit. > > > > > Each CPU tracks its own commit items and space usage, and these do not > > > > > have to be merged into the main CIL until either someone wants to push > > > > > the CIL items, or we run over a soft threshold and switch to slower (but > > > > > more accurate) accounting with atomics. > > > > > > > > > > Unfortunately, the for_each_cpu iteration suffers from the same race > > > > > with cpu dying problem that was identified in commit 8b57b11cca88f > > > > > ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from > > > > > cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a > > > > > result, both CIL percpu structure aggregation functions fail to collect > > > > > the items and accounted space usage at the correct point in time. > > > > > > > > > > If we're lucky, the items that are collected from the online cpus exceed > > > > > the space given to those cpus, and the log immediately shuts down in > > > > > xlog_cil_insert_items due to the (apparent) log reservation overrun. > > > > > This happens periodically with generic/650, which exercises cpu hotplug > > > > > vs. the filesystem code. > > > > > > > > > > Applying the same sort of fix from 8b57b11cca88f to the CIL code seems > > > > > to make the generic/650 problem go away, but I've been told that tglx > > > > > was not happy when he saw: > > > > > > > > > > "...the only thing we actually need to care about is that > > > > > percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when > > > > > there are no CPUs dying, it has no addition overhead except for a > > > > > cpumask_or() operation." > > > > > > > > > > I have no idea what the /correct/ solution is, but I've been holding on > > > > > to this patch for 3 months. In that time, 8b57b11cca88 hasn't been > > > > > reverted and cpu_dying_mask hasn't been removed, so I'm sending this and > > > > > we'll see what happens. > > > > > > > > > > So, how /do/ we perform periodic aggregation of per-cpu data safely? > > > > > Move the xlog_cil_pcp_dead call to the dying section, where at least the > > > > > other cpus will all be stopped? And have it dump its items into any > > > > > online cpu's data structure? > > > > > > > > I suspect that we have to stop using for_each_*cpu() and hotplug > > > > notifiers altogether for this code. > > > > > > > > That is, we simply create our own bitmap for nr_possible_cpus in the > > > > CIL checkpoint context we allocate for each checkpoint (i.e. the > > > > struct xfs_cil_ctx). When we store something on that CPU for that > > > > CIL context, we set the corresponding bit for that CPU. Then when we > > > > are aggregating the checkpoint, we simply walk all the cpus with the > > > > "has items" bit set and grab everything. > > > > > > > > This gets rid of the need for hotplug notifiers completely - we just > > > > don't care if the CPU is online or offline when we sweep the CIL > > > > context for items at push time - if the bit is set then there's > > > > stuff to sweep... > > > > > > Oooh, good idea. Something like this, then? Lightly tested via > > > generic/650 for five minutes... > > > > Not quite what I was saying - you put the cpu mask in the struct > > xfs_cil, which means it can potentially be accessed from both commit > > and aggregation at the same time for different checkpoints. > > > > What I was suggesting is that it gets placed in the struct > > xfs_cil_ctx which is guaranteed to be private to > > private to...? Sorry, I thought I removed that chunk from my reply - ignore it. .... > > ..... > > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c > > > index 09638e8fb4ee..ef7775657ce3 100644 > > > --- a/fs/xfs/xfs_super.c > > > +++ b/fs/xfs/xfs_super.c > > > @@ -2313,7 +2313,6 @@ xfs_cpu_dead( > > > list_for_each_entry_safe(mp, n, &xfs_mount_list, m_mount_list) { > > > spin_unlock(&xfs_mount_list_lock); > > > xfs_inodegc_cpu_dead(mp, cpu); > > > - xlog_cil_pcp_dead(mp->m_log, cpu); > > > spin_lock(&xfs_mount_list_lock); > > > } > > > spin_unlock(&xfs_mount_list_lock); > > > > I wonder if we can do something similar for the inodegc code, and > > once again get rid of the need for hotpulg notifiers in XFS? > > I would imagine so, if you don't mind bloating up xfs_mount. The xfs_mount is already huge... :) -Dave. -- Dave Chinner david@xxxxxxxxxxxxx