On Tue, 2021-12-14 at 21:06 -0800, Darrick J. Wong wrote: > On Wed, Dec 15, 2021 at 11:54:11AM +0800, Ian Kent wrote: > > On Wed, 2021-11-24 at 15:56 -0500, Brian Foster wrote: > > > On Tue, Nov 23, 2021 at 10:26:57AM +1100, Dave Chinner wrote: > > > > On Mon, Nov 22, 2021 at 02:27:59PM -0500, Brian Foster wrote: > > > > > On Mon, Nov 22, 2021 at 11:08:51AM +1100, Dave Chinner wrote: > > > > > > On Fri, Nov 19, 2021 at 02:44:21PM -0500, Brian Foster > > > > > > wrote: > > > > > > > In > > > > > > > any event, another experiment I ran in light of the above > > > > > > > results that > > > > > > > might be similar was to put the inode queueing component > > > > > > > of > > > > > > > destroy_inode() behind an rcu callback. This reduces the > > > > > > > single threaded > > > > > > > perf hit from the previous approach by about 50%. So not > > > > > > > entirely > > > > > > > baseline performance, but it's back closer to baseline if > > > > > > > I > > > > > > > double the > > > > > > > throttle threshold (and actually faster at 4x). Perhaps > > > > > > > my > > > > > > > crude > > > > > > > prototype logic could be optimized further to not rely on > > > > > > > percpu > > > > > > > threshold changes to match the baseline. > > > > > > > > > > > > > > My overall takeaway from these couple hacky experiments > > > > > > > is > > > > > > > that the > > > > > > > unconditional synchronous rcu wait is indeed probably too > > > > > > > heavy weight, > > > > > > > as you point out. The polling or callback (or perhaps > > > > > > > your > > > > > > > separate > > > > > > > queue) approach seems to be in the ballpark of viability, > > > > > > > however, > > > > > > > particularly when we consider the behavior of scaled or > > > > > > > mixed > > > > > > > workloads > > > > > > > (since inactive queue processing seems to be size driven > > > > > > > vs. > > > > > > > latency > > > > > > > driven). > > > > > > > > > > > > > > So I dunno.. if you consider the various severity and > > > > > > > complexity > > > > > > > tradeoffs, this certainly seems worth more consideration > > > > > > > to > > > > > > > me. I can > > > > > > > think of other potentially interesting ways to experiment > > > > > > > with > > > > > > > optimizing the above or perhaps tweak queueing to better > > > > > > > facilitate > > > > > > > taking advantage of grace periods, but it's not worth > > > > > > > going > > > > > > > too far down > > > > > > > that road if you're wedded to the "busy inodes" approach. > > > > > > > > > > > > I'm not wedded to "busy inodes" but, as your experiments > > > > > > are > > > > > > indicating, trying to handle rcu grace periods into the > > > > > > deferred > > > > > > inactivation path isn't completely mitigating the impact of > > > > > > having > > > > > > to wait for a grace period for recycling of inodes. > > > > > > > > > > > > > > > > What I'm seeing so far is that the impact seems to be limited > > > > > to > > > > > the > > > > > single threaded workload and largely mitigated by an increase > > > > > in > > > > > the > > > > > percpu throttle limit. IOW, it's not completely free right > > > > > out of > > > > > the > > > > > gate, but the impact seems isolated and potentially mitigated > > > > > by > > > > > adjustment of the pipeline. > > > > > > > > > > I realize the throttle is a percpu value, so that is what has > > > > > me > > > > > wondering about some potential for gains in efficiency to try > > > > > and > > > > > get > > > > > more of that single-threaded performance back in other ways, > > > > > or > > > > > perhaps > > > > > enhancements that might be more broadly beneficial to > > > > > deferred > > > > > inactivations in general (i.e. some form of adaptive > > > > > throttling > > > > > thresholds to balance percpu thresholds against a global > > > > > threshold). > > > > > > > > I ran experiments on queue depth early on. Once we go over a > > > > few > > > > tens of inodes we start to lose the "hot cache" effect and > > > > performance starts to go backwards. By queue depths of > > > > hundreds, > > > > we've lost all the hot cache and nothing else gets that > > > > performance > > > > back because we can't avoid the latency of all the memory > > > > writes > > > > from cache eviction and the followup memory loads that result. > > > > > > > > > > Admittedly my testing is simple/crude as I'm just exploring the > > > potential viability of a concept, not fine tuning a workload, > > > etc. > > > That > > > said, I'm curious to know what your tests for this look like > > > because > > > I > > > suspect I'm running into different conditions. My tests > > > frequently > > > hit > > > the percpu throttle threshold (256 inodes), which is beyond your > > > ideal > > > tens of inodes range (and probably more throttle limited than cpu > > > cache > > > limited). > > > > > > > Making the per-cpu queues longer or shorter based on global > > > > state > > > > won't gain us anything. ALl it will do is slow down local > > > > operations > > > > that don't otherwise need slowing down.... > > > > > > > > > > This leaves out context. The increase in throttle threshold > > > mitigates > > > the delays I've introduced via the rcu callback. That happens to > > > produce > > > observable results comparable to my baseline test, but it's more > > > of a > > > measure of the impact of the delay than a direct proposal. If > > > there's > > > a > > > more fine grained test worth running here (re: above), please > > > describe > > > it. > > > > > > > > > I suspect a rethink on the inode recycling mechanism is > > > > > > needed. > > > > > > THe > > > > > > way it is currently implemented was a brute force solution > > > > > > - it > > > > > > is > > > > > > simple and effective. However, we need more nuance in the > > > > > > recycling > > > > > > logic now. That is, if we are recycling an inode that is > > > > > > clean, has > > > > > > nlink >=1 and has not been unlinked, it means the VFS > > > > > > evicted > > > > > > it too > > > > > > soon and we are going to re-instantiate it as the identical > > > > > > inode > > > > > > that was evicted from cache. > > > > > > > > > > > > > > > > Probably. How advantageous is inode memory reuse supposed to > > > > > be > > > > > in the > > > > > first place? I've more considered it a necessary side effect > > > > > of > > > > > broader > > > > > architecture (i.e. deferred reclaim, etc.) as opposed to a > > > > > primary > > > > > optimization. > > > > > > > > Yes, it's an architectural feature resulting from the > > > > filesystem > > > > inode life cycle being different to the VFS inode life cycle. > > > > This > > > > was inherited from Irix - it had separate inactivation vs > > > > reclaim > > > > states and action steps for vnodes - inactivation occurred when > > > > the > > > > vnode refcount went to zero, reclaim occurred when the vnode > > > > was to > > > > be freed. > > > > > > > > Architecturally, Linux doesn't have this two-step > > > > infrastructure; > > > > it > > > > just has evict() that runs everything when the inode needs to > > > > be > > > > reclaimed. Hence we hide the two-phase reclaim architecture of > > > > XFS > > > > behind that, and so we always had this troublesome impedence > > > > mismatch on Linux. > > > > > > > > > > Ok, that was generally how I viewed it. > > > > > > > Thinking a bit about this, maybe there is a more integrated way > > > > to > > > > handle this life cycle impedence mismatch by making the way we > > > > interact with the linux inode cache to be more ... Irix like. > > > > Linux > > > > does actually give us a a callback when the last reference to > > > > an > > > > inode goes away: .drop_inode() > > > > > > > > i.e. Maybe we should look to triggering inactivation work from > > > > ->drop_inode instead of ->destroy_inode and hence always > > > > leaving > > > > unreferenced, reclaimable inodes in the VFS cache on the LRU. > > > > i.e. > > > > rather than hiding the two-phase XFS inode inactivation+reclaim > > > > algorithm from the VFS, move it up into the VFS. If we prevent > > > > inodes from being reclaimed from the LRU until they have > > > > finished > > > > inactivation and are clean (easy enough just by marking the > > > > inode > > > > as > > > > dirty), that would allow us to immediately reclaim and free > > > > inodes > > > > in evict() context. Integration with __wait_on_freeing_inode() > > > > would > > > > like solve the RCU reuse/recycling issues. > > > > > > > > > > Hmm.. this is the point where we decide whether the inode remains > > > cached, which is currently basically whether the inode has a link > > > count > > > or not. That makes me curious what (can) happens with an > > > unlinked/inactivated inode on the lru. I'm not sure any other fs' > > > do > > > anything like that currently..? > > > > > > > There's more to it than just this, but perhaps the longer term > > > > direction should be closer integration with the Linux inode > > > > cache > > > > life cycle rather than trying to solve all these problems > > > > underneath > > > > the VFS cache whilst still trying to play by it's rules... > > > > > > > > > > Yeah. Caching logic details aside, I think that makes sense. > > > > > > > > I still see reuse occur with deferred inactivation, we > > > > > just end up cycling through the same set of inodes as they > > > > > fall > > > > > through > > > > > the queue rather than recycling the same one over and over. > > > > > I'm > > > > > sort of > > > > > wondering what the impact would be if we didn't do this at > > > > > all > > > > > (for the > > > > > new allocation case at least). > > > > > > > > We end up with a larger pool of free inodes in the finobt. This > > > > is > > > > basically what my "busy inode check" proposal is based on - > > > > inodes > > > > that we can't allocate without recycling just remain on the > > > > finobt > > > > for longer before they can be used. This would be awful if we > > > > didn't > > > > have the finobt to efficiently locate free inodes - the finobt > > > > record iteration makes it pretty low overhead to scan inodes > > > > here. > > > > > > > > > > I get the idea. That last bit is what I'm skeptical about. The > > > finobt > > > is > > > based on the premise that free inode lookup becomes a predictable > > > tree > > > lookup instead of the old searching algorithm on the inobt, which > > > we > > > still support and can be awful in its own right under worst case > > > conditions. I agree that this would be bad on the inobt (which > > > raises > > > the question on how we'd provide these recycling correctness > > > guarantees > > > on !finobt fs'). What I'm more concerned about is whether this > > > could > > > make finobt enabled fs' (transiently) just as poor as the old > > > algo > > > under > > > certain workloads/conditions. > > > > > > I think there needs to be at least some high level description of > > > the > > > search algorithm before we can sufficiently reason about it's > > > behavior.. > > > > > > > > > So how much re-initialisation do we actually need for that > > > > > > inode? > > > > > > Almost everything in the inode is still valid; the problems > > > > > > come > > > > > > from inode_init_always() resetting the entire internal > > > > > > inode > > > > > > state > > > > > > and XFS then having to set them up again. The internal > > > > > > state > > > > > > is > > > > > > already largely correct when we start recycling, and the > > > > > > identity of > > > > > > the recycled inode does not change when nlink >= 1. Hence > > > > > > eliding > > > > > > inode_init_always() would also go a long way to avoiding > > > > > > the > > > > > > need > > > > > > for a RCU grace period to pass before we can make the inode > > > > > > visible > > > > > > to the VFS again. > > > > > > > > > > > > If we can do that, then the only indoes that need a grace > > > > > > period > > > > > > before they can be recycled are unlinked inodes as they > > > > > > change > > > > > > identity when being recycled. That identity change > > > > > > absolutely > > > > > > requires a grace period to expire before the new > > > > > > instantiation > > > > > > can > > > > > > be made visible. Given the arbitrary delay that this can > > > > > > introduce > > > > > > for an inode allocation operation, it seems much better > > > > > > suited > > > > > > to > > > > > > detecting busy inodes than waiting for a global OS state > > > > > > change > > > > > > to > > > > > > occur... > > > > > > > > > > > > > > > > Maybe..? The experiments I've been doing are aimed at > > > > > simplicity > > > > > and > > > > > reducing the scope of the changes. Part of the reason for > > > > > this is > > > > > tbh > > > > > I'm not totally convinced we really need to do anything more > > > > > complex > > > > > than preserve the inline symlink buffer one way or another > > > > > (for > > > > > example, > > > > > see the rfc patch below for an alternative to the inline > > > > > symlink > > > > > rcuwalk > > > > > disable patch). Maybe we should consider something like this > > > > > anyways. > > > > > > > > > > With busy inodes, we need to alter inode allocation to some > > > > > degree to > > > > > accommodate. We can have (tens of?) thousands of inodes under > > > > > the > > > > > grace > > > > > period at any time based on current batching behavior, so > > > > > it's > > > > > not > > > > > totally evident to me that we won't end up with some of the > > > > > same > > > > > fundamental issues to deal with here, just needing to be > > > > > accommodated in > > > > > the inode allocation algorithm rather than the teardown > > > > > sequence. > > > > > > > > Sure, but the purpose of the allocation selection > > > > policy is to select the best inode to allocate for the current > > > > context. The cost of not being able to use an inode > > > > immediately > > > > needs to be factored into that allocation policy. i.e. if the > > > > selected inode has an associated delay with it before it can be > > > > reused and other free inodes don't, then we should not be > > > > selecting > > > > the inode with a delay associcated with it. > > > > > > > > > > We still have to find those "no delay" inodes. AFAICT the worst > > > case > > > conditions on the system I've been playing with can have > > > something > > > like > > > 20k free && busy inodes. That could cover all or most of the > > > finobt > > > at > > > the time of an inode allocation. What happens from there depends > > > on > > > the > > > search algorithm. > > > > > > > This is exactly the reasoning and logic we use for busy > > > > extents. > > > > We > > > > only take the blocking penalty for resolving busy extent state > > > > if > > > > we > > > > run out of free extents to search before we've found an > > > > allocation > > > > candidate. I think it makes sense for inode allocation, too. > > > > > > > > > > Sure, the idea makes sense and it's worth looking into. But there > > > are > > > enough contextual differences that I wouldn't just assume the > > > same > > > logic > > > translates over to the finobt without potential for performance > > > impact. > > > For example, extent allocation has some advantages with things > > > like > > > delalloc (physical block allocation occurs async from buffered > > > write > > > syscall time) and the fact that metadata allocs can reuse busy > > > blocks. > > > The finobt only tracks existing chunks with free inodes, so it's > > > easily > > > possible to have conditions where the finobt is 100% (or > > > majority) > > > populated with busy inodes (whether it be one inode or several > > > thousand). > > > > > > This raises questions like at what point does search cost become > > > a > > > factor? At what point and with what frequency do we suffer the > > > blocking > > > penalty? Do we opt to allocate new chunks based on gp state? > > > Something > > > else? We don't need to answer these questions here (this thread > > > is > > > long > > > enough :P). I'm just trying to say that it's one thing to > > > consider > > > the > > > approach a viable option, but it isn't automatically preferable > > > just > > > because we use it for extents. Further details beyond "detect > > > busy > > > inodes" would be nice to objectively reason about. > > > > > > > > --- 8< --- > > > > > > > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > > > > index 64b9bf334806..058e3fc69ff7 100644 > > > > > --- a/fs/xfs/xfs_inode.c > > > > > +++ b/fs/xfs/xfs_inode.c > > > > > @@ -2644,7 +2644,7 @@ xfs_ifree( > > > > > * already been freed by xfs_attr_inactive. > > > > > */ > > > > > if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) { > > > > > - kmem_free(ip->i_df.if_u1.if_data); > > > > > + kfree_rcu(ip->i_df.if_u1.if_data); > > > > > ip->i_df.if_u1.if_data = NULL; > > > > > > > > That would need to be rcu_assign_pointer(ip- > > > > >i_df.if_u1.if_data, > > > > NULL) to put the correct memory barriers in place, right? Also, > > > > I > > > > think ip->i_df.if_u1.if_data needs to be set to NULL before > > > > calling > > > > kfree_rcu() so racing lookups will always see NULL before > > > > the object is freed. > > > > > > > > > > I think rcu_assign_pointer() is intended to be paired with the > > > associated rcu deref and for scenarios like making sure an object > > > isn't > > > made available until it's completely initialized (i.e. such as > > > for > > > rcu > > > protected list traversals, etc.). > > > > > > With regard to ordering, we no longer access if_data in rcuwalk > > > mode > > > with this change. Thus I think all we need here is the > > > WRITE_ONCE(i_link, NULL) that pairs with the READ_ONCE() in the > > > vfs, > > > and > > > that happens earlier in xfs_inactive_symlink() before we rcu free > > > the > > > memory here. With that, ISTM a racing lookup should either see an > > > rcu > > > protected i_link or NULL, the latter of which calls into - > > > >get_link() > > > and triggers refwalk mode. Hm? > > > > > > > But again, as I asked up front, why do we even need to free > > > > this > > > > memory buffer here? It will be freed in > > > > xfs_inode_free_callback() > > > > after the current RCU grace period expires, so what do we gain > > > > by > > > > freeing it separately here? > > > > > > > > The thing that's been bugging me is not knowing if the VFS has > > finished using the link string. > > > > The link itself can be removed while the link path could still > > be valid and the VFS will then walk that path. So rcu grace time > > might not be sufficient but stashing the pointer and rcu freeing > > it ensures the pointer won't go away while the walk is under way > > without any grace period guessing. Relating this to the current > > path walk via an rcu delayed free is a reliable way to do what's > > needed. > > > > The only problem here is that path string remaining while it is > > being used. If there aren't any other known problems with the > > xfs inode re-use sub system I don't see any worth in complicating > > it to cater for this special case. > > > > Brian's patch is a variation on the original patch and is all > > that's really needed. IMHO going this way (whatever we end up > > with) is the sensible thing to do. > > Why not simply change xfs_readlink to memcpy ip->i_df.if_u1.if_data > into > the caller's link buffer? We shouldn't be exposing internal XFS > metadata buffers to the VFS to scribble on without telling us; this > gets > rid of the dual inode_operations for symlinks; and we're probably > breaking a locking rule somewhere by not taking any locks AFAICT. > That > seems a lot less complex than adding rcu freeing rules to understand > how > to handle local format file forks. Ahhhaa ... I didn't/don't understand what these inline symlinks are. But it seems they aren't too different from the usual symlinks and treating them as such makes the problem go away. I'm pretty sure I was missing the point of some of this discussion too because I just didn't get it. Thanks for helping me get what's going on here Darrick. Ian > > (I say this from the vantage point of online repair, which will try > to > salvage damaged symlinks, for which we actually /do/ want to be able > to > lock out readers and change the data fork after symlink creation... > but > I was saving that for 2022 because I'm too overwhelmed to try to send > that again.) > > --D > > > > > Ian > > > > > > One prevented memory leak? ;) > > > > > > It won't be freed in xfs_inode_free_callback() because we change > > > the > > > data fork format type (and clear i_mode) in this path. Perhaps > > > that > > > could use an audit, but that's a separate issue. > > > > > > > > ip->i_df.if_bytes = 0; > > > > > } > > > > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > > > > > index a607d6aca5c4..e98d7f10ba7d 100644 > > > > > --- a/fs/xfs/xfs_iops.c > > > > > +++ b/fs/xfs/xfs_iops.c > > > > > @@ -511,27 +511,6 @@ xfs_vn_get_link( > > > > > return ERR_PTR(error); > > > > > } > > > > > > > > > > -STATIC const char * > > > > > -xfs_vn_get_link_inline( > > > > > - struct dentry *dentry, > > > > > - struct inode *inode, > > > > > - struct delayed_call *done) > > > > > -{ > > > > > - struct xfs_inode *ip = XFS_I(inode); > > > > > - char *link; > > > > > - > > > > > - ASSERT(ip->i_df.if_format == XFS_DINODE_FMT_LOCAL); > > > > > - > > > > > - /* > > > > > - * The VFS crashes on a NULL pointer, so return - > > > > > EFSCORRUPTED if > > > > > - * if_data is junk. > > > > > - */ > > > > > - link = ip->i_df.if_u1.if_data; > > > > > - if (XFS_IS_CORRUPT(ip->i_mount, !link)) > > > > > - return ERR_PTR(-EFSCORRUPTED); > > > > > - return link; > > > > > -} > > > > > - > > > > > static uint32_t > > > > > xfs_stat_blksize( > > > > > struct xfs_inode *ip) > > > > > @@ -1250,14 +1229,6 @@ static const struct inode_operations > > > > > xfs_symlink_inode_operations = { > > > > > .update_time = xfs_vn_update_time, > > > > > }; > > > > > > > > > > -static const struct inode_operations > > > > > xfs_inline_symlink_inode_operations = { > > > > > - .get_link = xfs_vn_get_link_inline, > > > > > - .getattr = xfs_vn_getattr, > > > > > - .setattr = xfs_vn_setattr, > > > > > - .listxattr = xfs_vn_listxattr, > > > > > - .update_time = xfs_vn_update_time, > > > > > -}; > > > > > - > > > > > /* Figure out if this file actually supports DAX. */ > > > > > static bool > > > > > xfs_inode_supports_dax( > > > > > @@ -1409,9 +1380,8 @@ xfs_setup_iops( > > > > > break; > > > > > case S_IFLNK: > > > > > if (ip->i_df.if_format == > > > > > XFS_DINODE_FMT_LOCAL) > > > > > - inode->i_op = > > > > > &xfs_inline_symlink_inode_operations; > > > > > - else > > > > > - inode->i_op = > > > > > &xfs_symlink_inode_operations; > > > > > + inode->i_link = ip- > > > > > >i_df.if_u1.if_data; > > > > > + inode->i_op = &xfs_symlink_inode_operations; > > > > > > > > This still needs corruption checks - ip->i_df.if_u1.if_data can > > > > be > > > > null if there's some kind of inode corruption detected. > > > > > > > > > > It's fine for i_link to be NULL. We'd just fall into the > > > get_link() > > > call > > > and have to handle it there like the current callback does. > > > > > > However, this does need to restore some of the code removed from > > > xfs_vn_get_link() in commit 30ee052e12b9 ("xfs: optimize inline > > > symlinks") to handle the local format case. If if_data can be > > > NULL > > > we'll > > > obviously need to handle it there anyways. > > > > > > If there's no fundamental objection I'll address these issues, > > > give > > > it > > > some proper testing and send a real patch.. > > > > > > Brian > > > > > > > > break; > > > > > default: > > > > > inode->i_op = &xfs_inode_operations; > > > > > diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c > > > > > index fc2c6a404647..20ec2f450c56 100644 > > > > > --- a/fs/xfs/xfs_symlink.c > > > > > +++ b/fs/xfs/xfs_symlink.c > > > > > @@ -497,6 +497,7 @@ xfs_inactive_symlink( > > > > > * do here in that case. > > > > > */ > > > > > if (ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) { > > > > > + WRITE_ONCE(VFS_I(ip)->i_link, NULL); > > > > > > > > Again, rcu_assign_pointer(), yes? > > > > > > > > > xfs_iunlock(ip, XFS_ILOCK_EXCL); > > > > > return 0; > > > > > } > > > > > > > > > > > > > > > > > > -- > > > > Dave Chinner > > > > david@xxxxxxxxxxxxx > > > > > > > > > > >