On Mon, Jan 24, 2022 at 10:12:45AM -0500, Brian Foster wrote: > On Fri, Jan 21, 2022 at 09:30:19PM -0800, Paul E. McKenney wrote: > > On Fri, Jan 21, 2022 at 01:33:46PM -0500, Brian Foster wrote: > > > On Fri, Jan 21, 2022 at 09:26:03AM -0800, Darrick J. Wong wrote: > > > > On Fri, Jan 21, 2022 at 09:24:54AM -0500, Brian Foster wrote: > > > > > The XFS inode allocation algorithm aggressively reuses recently > > > > > freed inodes. This is historical behavior that has been in place for > > > > > quite some time, since XFS was imported to mainline Linux. Once the > > > > > VFS adopted RCUwalk path lookups (also some time ago), this behavior > > > > > became slightly incompatible because the inode recycle path doesn't > > > > > isolate concurrent access to the inode from the VFS. > > > > > > > > > > This has recently manifested as problems in the VFS when XFS happens > > > > > to change the type or properties of a recently unlinked inode while > > > > > still involved in an RCU lookup. For example, if the VFS refers to a > > > > > previous incarnation of a symlink inode, obtains the ->get_link() > > > > > callback from inode_operations, and the latter happens to change to > > > > > a non-symlink type via a recycle event, the ->get_link() callback > > > > > pointer is reset to NULL and the lookup results in a crash. > > > > > > > > Hmm, so I guess what you're saying is that if the memory buffer > > > > allocation in ->get_link is slow enough, some other thread can free the > > > > inode, drop it, reallocate it, and reinstantiate it (not as a symlink > > > > this time) all before ->get_link's memory allocation call returns, after > > > > which Bad Things Happen(tm)? > > > > > > > > Can the lookup thread end up with the wrong inode->i_ops too? > > > > > > > > > > We really don't need to even get into the XFS symlink code to reason > > > about the fundamental form of this issue. Consider that an RCU walk > > > starts, locates a symlink inode, meanwhile XFS recycles that inode into > > > something completely different, then the VFS loads and calls > > > ->get_link() (which is now NULL) on said inode and explodes. So the > > > presumption is that the VFS uses RCU protection to rely on some form of > > > stability of the inode (i.e., that the inode memory isn't freed, > > > callback vectors don't change, etc.). > > > > > > Validity of the symlink content is a variant of that class of problem, > > > likely already addressed by the recent inline symlink change, but that > > > doesn't address the broader issue. > > > > > > > > To avoid this class of problem, isolate in-core inodes for recycling > > > > > with an RCU grace period. This is the same level of protection the > > > > > VFS expects for inactivated inodes that are never reused, and so > > > > > guarantees no further concurrent access before the type or > > > > > properties of the inode change. We don't want an unconditional > > > > > synchronize_rcu() event here because that would result in a > > > > > significant performance impact to mixed inode allocation workloads. > > > > > > > > > > Fortunately, we can take advantage of the recently added deferred > > > > > inactivation mechanism to mitigate the need for an RCU wait in most > > > > > cases. Deferred inactivation queues and batches the on-disk freeing > > > > > of recently destroyed inodes, and so significantly increases the > > > > > likelihood that a grace period has elapsed by the time an inode is > > > > > freed and observable by the allocation code as a reuse candidate. > > > > > Capture the current RCU grace period cookie at inode destroy time > > > > > and refer to it at allocation time to conditionally wait for an RCU > > > > > grace period if one hadn't expired in the meantime. Since only > > > > > unlinked inodes are recycle candidates and unlinked inodes always > > > > > require inactivation, > > > > > > > > Any inode can become a recycle candidate (i.e. RECLAIMABLE but otherwise > > > > idle) but I think your point here is that unlinked inodes that become > > > > recycling candidates can cause lookup threads to trip over symlinks, and > > > > that's why we need to assign RCU state and poll on it, right? > > > > > > > > > > Good point. When I wrote the commit log I was thinking of recycled > > > inodes as "reincarnated" inodes, so that wording could probably be > > > improved. But yes, the code is written minimally/simply so I was trying > > > to document that it's unlinked -> freed -> reallocated inodes that we > > > really care about here. > > > > > > WRT to symlinks, I was trying to use that as an example and not > > > necessarily as the general reason for the patch. I.e., the general > > > reason is that the VFS uses rcu protection for inode stability (just as > > > for the inode free path), and the symlink thing is just an example of > > > how things can go wrong in the current implementation without it. > > > > > > > (That wasn't a challenge, I'm just making sure I understand this > > > > correctly.) > > > > > > > > > we only need to poll and assign RCU state in > > > > > the inactivation codepath. Slightly adjust struct xfs_inode to fit > > > > > the new field into padding holes that conveniently preexist in the > > > > > same cacheline as the deferred inactivation list. > > > > > > > > > > Finally, note that the ideal long term solution here is to > > > > > rearchitect bits of XFS' internal inode lifecycle management such > > > > > that this additional stall point is not required, but this requires > > > > > more thought, time and work to address. This approach restores > > > > > functional correctness in the meantime. > > > > > > > > > > Signed-off-by: Brian Foster <bfoster@xxxxxxxxxx> > > > > > --- > > > > > > > > > > Hi all, > > > > > > > > > > Here's the RCU fixup patch for inode reuse that I've been playing with, > > > > > re: the vfs patch discussion [1]. I've put it in pretty much the most > > > > > basic form, but I think there are a couple aspects worth thinking about: > > > > > > > > > > 1. Use and frequency of start_poll_synchronize_rcu() (vs. > > > > > get_state_synchronize_rcu()). The former is a bit more active than the > > > > > latter in that it triggers the start of a grace period, when necessary. > > > > > This currently invokes per inode, which is the ideal frequency in > > > > > theory, but could be reduced, associated with the xfs_inogegc thresholds > > > > > in some manner, etc., if there is good reason to do that. > > > > > > > > If you rm -rf $path, do each of the inodes get a separate rcu state, or > > > > do they share? > > > > > > My previous experiments on a teardown grace period had me thinking > > > batching would occur, but I don't recall which RCU call I was using at > > > the time so I'd probably have to throw a tracepoint in there to dump > > > some of the grace period values and double check to be sure. (If this is > > > not the case, that might be a good reason to tweak things as discussed > > > above). > > > > An RCU grace period typically takes some milliseconds to complete, so a > > great many inodes would end up being tagged for the same grace period. > > For example, if "rm -rf" could delete one file per microsecond, the > > first few thousand files would be tagged with one grace period, > > the next few thousand with the next grace period, and so on. > > > > In the unlikely event that RCU was totally idle when the "rm -rf" > > started, the very first file might get its own grace period, but > > they would batch in the thousands thereafter. > > > > Great, thanks for the info. > > > On start_poll_synchronize_rcu() vs. get_state_synchronize_rcu(), if > > there is always other RCU update activity, get_state_synchronize_rcu() > > is just fine. So if XFS does a call_rcu() or synchronize_rcu() every > > so often, all you need here is get_state_synchronize_rcu()(). > > > > Another approach is to do a start_poll_synchronize_rcu() every 1,000 > > events, and use get_state_synchronize_rcu() otherwise. And there are > > a lot of possible variations on that theme. > > > > But why not just try always doing start_poll_synchronize_rcu() and > > only bother with get_state_synchronize_rcu() if that turns out to > > be too slow? > > > > Ack, that makes sense to me. We use call_rcu() to free inode memory and > obviously will have a sync in the lookup path after this patch, but that > is a consequence of the polling we add at the same time. I'm not sure > that's enough activity on our own so I'd probably prefer to keep things > simple, use the start_poll_*() variant from the start, and then consider > further start/get filtering like you describe above if it ever becomes a > problem. > > > > > > 2. The rcu cookie lifecycle. This variant updates it on inactivation > > > > > queue and nowhere else because the RCU docs imply that counter rollover > > > > > is not a significant problem. In practice, I think this means that if an > > > > > inode is stamped at least once, and the counter rolls over, future > > > > > (non-inactivation, non-unlinked) eviction -> repopulation cycles could > > > > > trigger rcu syncs. I think this would require repeated > > > > > eviction/reinstantiation cycles within a small window to be noticeable, > > > > > so I'm not sure how likely this is to occur. We could be more defensive > > > > > by resetting or refreshing the cookie. E.g., refresh (or reset to zero) > > > > > at recycle time, unconditionally refresh at destroy time (using > > > > > get_state_synchronize_rcu() for non-inactivation), etc. > > > > Even on a 32-bit system that is running RCU grace periods as fast as they > > will go, it will take about 12 days to overflow that counter. But if > > you have an inode sitting on the list for that long, yes, you could > > see unnecessary synchronous grace-period waits. > > > > Would it help if there was an API that gave you a special cookie value > > that cond_synchronize_rcu() and friends recognized as "already expired"? > > That way if poll_state_synchronize_rcu() says that original cookie > > has expired, you could replace that cookie value with one that would > > stay expired. Maybe a get_expired_synchronize_rcu() or some such? > > > > Hmm.. so I think this would be helpful if we were to stamp the inode > conditionally (i.e. unlinked inodes only) on eviction because then we > wouldn't have to worry about clearing the cookie if said inode happens > to be reallocated and then run through one or more eviction -> recycle > sequences after a rollover of the grace period counter. With that sort > of scheme, the inode could be sitting in cache for who knows how long > with a counter that was conditionally synced against many days (or > weeks?) prior, from whenever it was initially reallocated. > > However, as Dave points out that we probably want to poll RCU state on > every inode eviction, I suspect that means this is less of an issue. An > inode must be evicted for it to become a recycle candidate, and so if we > update the inode unconditionally on every eviction, then I think the > recycle code should always see the most recent cookie value and we don't > have to worry much about clearing it. > > I think it's technically possible for an inode to sit in an inactivation > queue for that sort of time period, but that would probably require the > filesystem go idle or drop to low enough activity that a spurious rcu > sync here or there is probably not a big deal. So all in all, I suspect > if we already had such a special cookie variant of the API that was > otherwise functionally equivalent, I'd probably use it to cover that > potential case, but it's not clear to me atm that this use case > necessarily warrants introduction of such an API... If you need it, it happens to be easy to provide. If you don't need it, I am of course happy to avoid adding another RCU API member. ;-) Thanx, Paul > > > > > Otherwise testing is ongoing, but this version at least survives an > > > > > fstests regression run. > > > > > > > > > > Brian > > > > > > > > > > [1] https://lore.kernel.org/linux-fsdevel/164180589176.86426.501271559065590169.stgit@xxxxxxxxxxxxxxxxx/ > > > > > > > > > > fs/xfs/xfs_icache.c | 11 +++++++++++ > > > > > fs/xfs/xfs_inode.h | 3 ++- > > > > > 2 files changed, 13 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > > > index d019c98eb839..4931daa45ca4 100644 > > > > > --- a/fs/xfs/xfs_icache.c > > > > > +++ b/fs/xfs/xfs_icache.c > > > > > @@ -349,6 +349,16 @@ xfs_iget_recycle( > > > > > spin_unlock(&ip->i_flags_lock); > > > > > rcu_read_unlock(); > > > > > > > > > > + /* > > > > > + * VFS RCU pathwalk lookups dictate the same lifecycle rules for an > > > > > + * inode recycle as for freeing an inode. I.e., we cannot repurpose the > > > > > + * inode until a grace period has elapsed from the time the previous > > > > > + * version of the inode was destroyed. In most cases a grace period has > > > > > + * already elapsed if the inode was (deferred) inactivated, but > > > > > + * synchronize here as a last resort to guarantee correctness. > > > > > + */ > > > > > + cond_synchronize_rcu(ip->i_destroy_gp); > > > > > + > > > > > ASSERT(!rwsem_is_locked(&inode->i_rwsem)); > > > > > error = xfs_reinit_inode(mp, inode); > > > > > if (error) { > > > > > @@ -2019,6 +2029,7 @@ xfs_inodegc_queue( > > > > > trace_xfs_inode_set_need_inactive(ip); > > > > > spin_lock(&ip->i_flags_lock); > > > > > ip->i_flags |= XFS_NEED_INACTIVE; > > > > > + ip->i_destroy_gp = start_poll_synchronize_rcu(); > > > > > > > > Hmm. The description says that we only need the rcu synchronization > > > > when we're freeing an inode after its link count drops to zero, because > > > > that's the vector for (say) the VFS inode ops actually changing due to > > > > free/inactivate/reallocate/recycle while someone else is doing a lookup. > > > > > > > > > > Right.. > > > > > > > I'm a bit puzzled why this unconditionally starts an rcu grace period, > > > > instead of done only if i_nlink==0; and why we call cond_synchronize_rcu > > > > above unconditionally instead of checking for i_mode==0 (or whatever > > > > state the cached inode is left in after it's freed)? > > > > > > > > > > Just an attempt to start simple and/or make any performance > > > test/problems more blatant. I probably could have tagged this RFC. My > > > primary goal with this patch was to establish whether the general > > > approach is sane/viable/acceptable or we need to move in another > > > direction. > > > > > > That aside, I think it's reasonable to have explicit logic around the > > > unlinked case if we want to keep it restricted to that, though I would > > > probably implement that as a conditional i_destroy_gp assignment and let > > > the consumer context key off whether that field is set rather than > > > attempt to infer unlinked logic (and then I guess reset it back to zero > > > so it doesn't leak across reincarnation). That also probably facilitates > > > a meaningful tracepoint to track the cases that do end up syncing, which > > > helps with your earlier question around batching, so I'll look into > > > those changes once I get through broader testing > > > > > > Brian > > > > > > > --D > > > > > > > > > spin_unlock(&ip->i_flags_lock); > > > > > > > > > > gc = get_cpu_ptr(mp->m_inodegc); > > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h > > > > > index c447bf04205a..2153e3edbb86 100644 > > > > > --- a/fs/xfs/xfs_inode.h > > > > > +++ b/fs/xfs/xfs_inode.h > > > > > @@ -40,8 +40,9 @@ typedef struct xfs_inode { > > > > > /* Transaction and locking information. */ > > > > > struct xfs_inode_log_item *i_itemp; /* logging information */ > > > > > mrlock_t i_lock; /* inode lock */ > > > > > - atomic_t i_pincount; /* inode pin count */ > > > > > struct llist_node i_gclist; /* deferred inactivation list */ > > > > > + unsigned long i_destroy_gp; /* destroy rcugp cookie */ > > > > > + atomic_t i_pincount; /* inode pin count */ > > > > > > > > > > /* > > > > > * Bitsets of inode metadata that have been checked and/or are sick. > > > > > -- > > > > > 2.31.1 > > > > > > > > > > > > > > >