On Tue, Dec 05, 2017 at 06:02:08PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 12:36:48PM +1100, Dave Chinner wrote: > > > - if (radix_tree_preload(GFP_NOFS)) > > > - return -ENOMEM; > > > - > > > INIT_LIST_HEAD(&elem->list_node); > > > elem->key = key; > > > > > > - spin_lock(&mru->lock); > > > - error = radix_tree_insert(&mru->store, key, elem); > > > - radix_tree_preload_end(); > > > - if (!error) > > > - _xfs_mru_cache_list_insert(mru, elem); > > > - spin_unlock(&mru->lock); > > > + do { > > > + xas_lock(&xas); > > > + xas_store(&xas, elem); > > > + error = xas_error(&xas); > > > + if (!error) > > > + _xfs_mru_cache_list_insert(mru, elem); > > > + xas_unlock(&xas); > > > + } while (xas_nomem(&xas, GFP_NOFS)); > > > > Ok, so why does this have a retry loop on ENOMEM despite the > > existing code handling that error? And why put such a loop in this > > code and not any of the other XFS code that used > > radix_tree_preload() and is arguably much more important to avoid > > ENOMEM on insert (e.g. the inode cache)? > > If we need more nodes in the tree, xas_store() will try to allocate them > with GFP_NOWAIT | __GFP_NOWARN. If that fails, it signals it in xas_error(). > xas_nomem() will notice that we're in an ENOMEM situation, and allocate > a node using your preferred GFP flags (NOIO in your case). Then we retry, > guaranteeing forward progress. [1] > > The other conversions use the normal API instead of the advanced API, so > all of this gets hidden away. For example, the inode cache does this: > > + curr = xa_cmpxchg(&pag->pag_ici_xa, agino, NULL, ip, GFP_NOFS); > > and xa_cmpxchg internally does: > > do { > xa_lock_irqsave(xa, flags); > curr = xas_create(&xas); > if (curr == old) > xas_store(&xas, entry); > xa_unlock_irqrestore(xa, flags); > } while (xas_nomem(&xas, gfp)); Ah, OK, that's not obvious from the code changes. :/ However, it's probably overkill for XFS. In all the cases, when we insert there should be no entry in the tree - the radix tree insert error handling code there was simply catching "should never happen" cases and handling it without crashing. Now that I've looked at this, I have to say that having a return value of NULL meaning "success" is quite counter-intuitive. That's going to fire my "that looks so wrong" detector every time I look at the code and notice it's erroring out on a non-null return value that isn't a PTR_ERR case.... Also, there's no need for irqsave/restore() locking contexts here as we never access these caches from interrupt contexts. That's just going to be extra overhead, especially on workloads that run 10^6 inodes inodes through the cache every second. That's a problem caused by driving the locks into the XA structure and then needing to support callers that require irq safety.... > > Also, I really don't like the pattern of using xa_lock()/xa_unlock() > > to protect access to an external structure. i.e. the mru->lock > > context is protecting multiple fields and operations in the MRU > > structure, not just the radix tree operations. Turning that around > > so that a larger XFS structure and algorithm is now protected by an > > opaque internal lock from generic storage structure the forms part > > of the larger structure seems like a bad design pattern to me... > > It's the design pattern I've always intended to use. Naturally, the > xfs radix trees weren't my initial target; it was the page cache, and > the page cache does the same thing; uses the tree_lock to protect both > the radix tree and several other fields in that same data structure. > > I'm open to argument on this though ... particularly if you have a better > design pattern in mind! I don't mind structures having internal locking - I have a problem with leaking them into contexts outside the structure they protect. That way lies madness - you can't change the internal locking in future because of external dependencies, and the moment you need something different externally we've got to go back to an external lock anyway. This is demonstrated by the way you converted the XFS dquot tree - you didn't replace the dquot tree lock with the internal xa_lock because it's a mutex and we have to sleep holding it. IOWs, we've added another layer of locking here, not simplified the code. What I really see here is that we have inconsistent locking patterns w.r.t. XA stores inside XFS - some have an external mutex to cover a wider scope, some use xa_lock/xa_unlock to span multiple operations, some directly access the internal xa lock via direct spin_lock/unlock(...xa_lock) calls and non-locking XA call variants. In some places you remove explicit rcu_read_lock() calls because the internal xa_lock implies RCU, but in other places we still need them because we have to protect the objects the tree points to, not just the tree.... IOWs, there's no consistent pattern to the changes you've made to the XFS code. The existing radix tree code has clear, consistent locking, tagging and lookup patterns. In contrast, each conversion to the XA code has resulted in a different solution for each radix tree conversion. Yes, there's been a small reduction in the amoutn of code in converting to the XA API, but it comes at the cost of consistency and ease of understanding the code that uses the radix tree API. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx