On Fri, Dec 06, 2019 at 08:02:11AM +1100, Dave Chinner wrote: > On Thu, Dec 05, 2019 at 12:50:36PM -0500, Brian Foster wrote: > > This is an AIL based mechanism to enable automatic relogging of > > selected log items. The use case is for particular operations that > > commit an item known to pin the tail of the log for a potentially > > long period of time and otherwise cannot use a rolling transaction. > > While this does not provide the deadlock avoidance guarantees of a > > rolling transaction, it ties the relog transaction into AIL pushing > > pressure such that we should expect the transaction to reserve the > > necessary log space long before deadlock becomes a problem. > > > > To enable relogging, a bit is set on the log item before it is first > > committed to the log subsystem. Once the item commits to the on-disk > > log and inserts to the AIL, AIL pushing dictates when the item is > > ready for a relog. When that occurs, the item relogs in an > > independent transaction to ensure the log tail keeps moving without > > intervention from the original committer. To disable relogging, the > > original committer clears the log item bit and optionally waits on > > relogging activity to cease if it needs to reuse the item before the > > operation completes. > > > > While the current use case for automatic relogging is limited, the > > mechanism is AIL based because it 1.) provides existing callbacks > > into all possible log item types for future support and 2.) has > > applicable context to determine when to relog particular items (such > > as when an item pins the log tail). This provides enough flexibility > > to support various log item types and future workloads without > > introducing complexity up front for currently unknown use cases. > > Further complexity, such as preallocated or regranted relog > > transaction reservation or custom relog handlers can be considered > > as the need arises. > > .... > > > +/* > > + * Relog log items on the AIL relog queue. > > + */ > > +static void > > +xfs_ail_relog( > > + struct work_struct *work) > > +{ > > + struct xfs_ail *ailp = container_of(work, struct xfs_ail, > > + ail_relog_work); > > + struct xfs_mount *mp = ailp->ail_mount; > > + struct xfs_trans *tp; > > + struct xfs_log_item *lip, *lipp; > > + int error; > > Probably need PF_MEMALLOC here - this will need to make progress in > low memory situations, just like the xfsaild. > > > + /* XXX: define a ->tr_relog reservation */ > > + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp); > > + if (error) > > + return; > > Also needs memalloc_nofs_save() around this whole function, because > we most definitely don't want this recursing into the filesystem and > running transactions when we are trying to ensure we don't pin the > tail of the log... > Ok. Note that this will likely change to not allocate the transaction here, but I'll revisit this when appropriate. > > + > > + spin_lock(&ailp->ail_lock); > > + list_for_each_entry_safe(lip, lipp, &ailp->ail_relog_list, li_trans) { > > + list_del_init(&lip->li_trans); > > + xfs_trans_add_item(tp, lip); > > + set_bit(XFS_LI_DIRTY, &lip->li_flags); > > + tp->t_flags |= XFS_TRANS_DIRTY; > > + } > > + spin_unlock(&ailp->ail_lock); > > + > > + error = xfs_trans_commit(tp); > > + ASSERT(!error); > > +} > > Simple, but I see several issues here regarding how generic this > approach is. > > 1. Memory reclaim deadlock. Needs to be run under > memalloc_nofs_save() context (and possibly PF_MEMALLOC) because > transaction allocation can trigger reclaim and that can run > filesystem transactions and can block waiting for log space. If we > already pin the tail of the log, then we deadlock in memory > reclaim... > > 2. Transaction reservation deadlock. If the item being relogged is > already at the tail of the log, or if the item that pins the tail > preventing further log reservations from finding space is in the > relog list, we will effectively deadlock the filesystem here. > Yes, this was by design. There are a couple angles to this.. First, my testing to this point suggests that the current use case doesn't require this level of guarantee. This deadlock can manifest in a few seconds when combined with a parallel workload and small log. I've been stalling the quotaoff completion for minutes under those same conditions to this point without reproducing any sort of issues. That said, I think there are other issues with this approach. This hasn't been tested thoroughly yet, for one. :) I'm not sure the practical behavior would apply to the scrub use case (more relogged items) or non-intent items (with lock contention). Finally, I'm not sure it's totally appropriate to lock items for relog before we acquire reservation for the relog transaction. All in all, I'm still considering changes in this area. It's just a matter of finding an approach that is simple enough and isn't overkill in terms of overhead or complexity... > 3. it's going to require a log reservation for every item on the > relog list, in total. That cannot be known ahead of time as the > reservation is dependent on the dirty size of the items being added > to the transaction. Hence, if this is a generic mechanism, the > required reservation could be quite large... If it's too large, see > #1. > Yeah, I punted on the transaction management for this RFC because 1.) it's independent from the mechanics of how to relog items and so not necessary to demonstrate that concept and 2.) I can think of at least three or four different ways to manage transaction reservation to address this problem, with varying levels of complexity/flexibility. See the previous RFCs for examples of what I mean. I've been thinking about doing something similar for this variant, but a bit more simple: - Original (i.e. quotaoff) transaction becomes a permanent transaction. - Immediately after transaction allocation, explicitly charge the relogging subsystem with a certain amount of (fixed size) reservation and roll the transaction. - The rolled transaction proceeds as the original would have, sets the relog bit on associated item(s) and commits. The relogged items are thus added to the log subsystem with an already charged relog system. Note that the aforementioned relog reservation is undone in the event of error/cancel before the final transaction commits. On the backend/relog side of things: - Establish a max item count for the relog transaction (probably 1 for now). When the max is hit, the relog worker instance rolls the transaction and repeats until it drains the current relog item list (similar to a truncate operation, for example). - Keep a global count of the number of active relog items somewhere. Once this drops to zero, the relog transaction is free to be cancelled and the reservation returned to the grant heads. The latter bit essentially means the first relog transaction is responsible for charging the relog system and doing the little transaction roll dance, while subsequent parallel users effectively acquire a reference count on the existence of the permanent relog transaction until there are zero active reloggable items in the subsystem. The transaction itself probably starts out as a fixed ->tr_relog reservation defined to the maximum supported reloggable transaction (i.e. quotaoff, for now). ISTM this keeps things reasonably simple, doesn't introduce runtime overhead unless relogging is active and guarantees deadlock avoidance. Thoughts on that? > 4. xfs_trans_commit() requires objects being logged to be locked > against concurrent modification. Locking in the AIL context is > non-blocking, so the item needs to be locked against modification > before it is added to the ail_relog_list, and won't get unlocked > until it it committed again. > Yep.. > 5. Locking random items in random order into the ail_relog_list > opens us up to ABBA deadlocks with locking in ongoing modifications. > This relies on locking behavior as already implemented by xfsaild. The whole point is to tie into existing infrastructure rather than duplicate it or introduce new access patterns. All this does is change how we handle certain log items once they are locked. It doesn't introduce any new locking patterns that don't already occur. > 6. If the item is already dirty in a transaction (i.e. currently > locked and joined to another transaction - being relogged) then then > xfs_trans_add_item() is either going to fire asserts because > XFS_LI_DIRTY is already set or it's going to corrupt the item list > of that already running transaction. > As above, the item isn't processed unless it is locked or otherwise exclusive. In a general sense, there are no new log item access patterns here. Relogging follows the same log item access rules as the rest of the code. > Given this, I'm skeptical this can be made into a useful, reliable > generic async relogging mechanism. > Fair, but that's also not the current goal. The goal is to address the current use cases of quotaoff and scrub with consideration for similar, basic handling of arbitrary log items in the future. It's fairly trivial to implement relogging of a buffer or inode on top of this, for example. If more complex use cases arise, we can build in more complexity as needed or further genericize the mechanism, but I see no need to build in complexity or introduce major abstractions given the current couple of fairly quirky use cases and otherwise lack of further requirements. I view the lack of infrastructure as an advantage so I'm intentionally trying to keep things simple. IOW, worst case, the mechanism in this approach is easy to repurpose because after the reservation management bits, it's just a couple log item states, a list and a workqueue job. The locking issues noted above strike me as a misunderstanding one way or another (which is easy to misinterpret given the undefined use case). Given that and with the transaction issues resolved I don't see the remaining concerns as blockers. If you have thoughts on some future use cases that obviously conflict or might warrant further complexity/requirements right now, I'd be interested to hear more about it. Otherwise I'm just kind of guessing at things in that regard; it's hard to design a system around unknown requirements... (I could also look into wiring up an arbitrary non-intent log item as a canary, I suppose. That would be a final RFC patch that serves no real purpose and wouldn't be merged, but would hook up the mechanism for testing purposes, help prove sanity and provide some actual code to reason about.) > > /* > > * The cursor keeps track of where our current traversal is up to by tracking > > * the next item in the list for us. However, for this to be safe, removing an > > @@ -363,7 +395,7 @@ static long > > xfsaild_push( > > struct xfs_ail *ailp) > > { > > - xfs_mount_t *mp = ailp->ail_mount; > > + struct xfs_mount *mp = ailp->ail_mount; > > struct xfs_ail_cursor cur; > > struct xfs_log_item *lip; > > xfs_lsn_t lsn; > > @@ -425,6 +457,13 @@ xfsaild_push( > > ailp->ail_last_pushed_lsn = lsn; > > break; > > > > + case XFS_ITEM_RELOG: > > + trace_xfs_ail_relog(lip); > > + ASSERT(list_empty(&lip->li_trans)); > > + list_add_tail(&lip->li_trans, &ailp->ail_relog_list); > > + set_bit(XFS_LI_RELOGGED, &lip->li_flags); > > + break; > > + > > case XFS_ITEM_FLUSHING: > > /* > > * The item or its backing buffer is already being > > @@ -491,6 +530,9 @@ xfsaild_push( > > if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) > > ailp->ail_log_flush++; > > > > + if (!list_empty(&ailp->ail_relog_list)) > > + queue_work(ailp->ail_relog_wq, &ailp->ail_relog_work); > > + > > Hmmm. Nothing here appears to guarantee forwards progress of the > relogged items? We just queue the items and move on? What if we scan > the ail and trip over the item again before it's been processed by > the relog worker function? > See the XFS_LI_RELOGGED bit (and specifically how it is used in the next patch). This could be lifted up a layer if that is more clear. > > if (!count || XFS_LSN_CMP(lsn, target) >= 0) { > > out_done: > > /* > > @@ -834,15 +876,24 @@ xfs_trans_ail_init( > > spin_lock_init(&ailp->ail_lock); > > INIT_LIST_HEAD(&ailp->ail_buf_list); > > init_waitqueue_head(&ailp->ail_empty); > > + INIT_LIST_HEAD(&ailp->ail_relog_list); > > + INIT_WORK(&ailp->ail_relog_work, xfs_ail_relog); > > + > > + ailp->ail_relog_wq = alloc_workqueue("xfs-relog/%s", WQ_FREEZABLE, 0, > > + mp->m_super->s_id); > > It'll need WQ_MEMRECLAIM so it has a rescuer thread (relogging has > to work when we are low on memory), and possibly WQ_HIPRI so that it > doesn't get stuck behind other workqueues that run transactions and > run the log out of space before we try to reserve log space for the > relogging.... > Ok, I had dropped those for now because it wasn't immediately clear if they were needed. I'll take another look at this. Thanks for the feedback... Brian > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >