On Thu, 2010-09-09 at 01:12 +1000, Dave Chinner wrote: > From: Dave Chinner <dchinner@xxxxxxxxxx> > > The buffer cache hash is starting to show typical hash scalability > problems. large scale testing is showing the number of cached items > growing far larger than the hash can efficiently handle. Hence we > need to move to a self-scaling cache indexing mechanism. > > I have selected rbtrees for indexing becuse they can have O(log n) > search scalability, and insert and remove cost is not excessive, > even on large trees. Hence we should be able to cache large numbers > of buffers without incurring the excessive cache miss search > penalties that the hash is imposing on us. > > To ensure we still have parallel access to the cache, we need > multiple trees. Rather than hashing the buffers by disk address to > select a tree, it seems more sensible to separate trees by typical > access patterns. Most operations use buffers from within a single AG > at a time, so rather than searching lots of different lists, > separate the buffer indexes out into per-AG rbtrees. This means that > searches during metadata operation have a much higher chance of > hitting cache resident nodes, and that updates of the tree are less > likely to disturb trees being accessed on other CPUs doing > independent operations. I found a bug in this, and have a bunch of other less important comments for you to consider. I haven't spent any time contemplating your decision to use RB trees on AG's but it seems quite reasonable. > Signed-off-by: Dave Chinner <dchinner@xxxxxxxxxx> > --- > fs/xfs/linux-2.6/xfs_buf.c | 139 +++++++++++++++++++++-------------------- > fs/xfs/linux-2.6/xfs_buf.h | 14 ++-- > fs/xfs/linux-2.6/xfs_super.c | 6 +- > fs/xfs/xfs_ag.h | 4 + > fs/xfs/xfs_mount.c | 4 +- > 5 files changed, 87 insertions(+), 80 deletions(-) > > diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c > index b2b5dea..facd37e 100644 > --- a/fs/xfs/linux-2.6/xfs_buf.c > +++ b/fs/xfs/linux-2.6/xfs_buf.c . . . > @@ -432,14 +434,36 @@ _xfs_buf_find( > ASSERT(!(range_length < (1 << btp->bt_sshift))); > ASSERT(!(range_base & (xfs_off_t)btp->bt_smask)); > > - hash = &btp->bt_hash[hash_long((unsigned long)ioff, btp->bt_hashshift)]; > - > - spin_lock(&hash->bh_lock); > - > - list_for_each_entry_safe(bp, n, &hash->bh_list, b_hash_list) { > - ASSERT(btp == bp->b_target); > - if (bp->b_file_offset == range_base && > - bp->b_buffer_length == range_length) { > + /* get tree root */ > + pag = xfs_perag_get(btp->bt_mp, xfs_daddr_to_agno(btp->bt_mp, ioff)); > + > + /* walk tree */ > + spin_lock(&pag->pagbuf_lock); > + rbp = &pag->pagbuf_tree.rb_node; > + parent = NULL; Could drop the above assignment and change the while statement to read: while ((parent = *rbp)) { (I know that leads to a mildly controversial style.) > + bp = NULL; > + while (*rbp) { > + parent = *rbp; > + bp = rb_entry(parent, struct xfs_buf, b_rbnode); Below here you might as well make use of the value of "parent" in places where (*rbp) is used. I realize you're just mimicking what's in the rbtree.h file though... If the result seems to read funny, maybe you could rename "parent" to be "entry" or something. Here's the BUG: > + if (bp->b_file_offset < range_base) > + rbp = &(*rbp)->rb_left; > + else if (bp->b_file_offset > range_base) > + rbp = &(*rbp)->rb_right; Your comparisons here are reversed. In other words, I believe it should be: if (range_base < bp->b_file_offset) rbp = &parent->rb_left; else if (range_base > bp->b_file_offset) rbp = &parent->rb_right; Maybe it mostly works in the order you have it, but I'm pretty sure it's wrong. (The "continue searching to the right" below would be wrong though.) > + else { > + /* > + * found a block offset match. If the range doesn't > + * match, the only way this is allowed is if the buffer > + * in the cache is stale and the transaction that made > + * it stale has not yet committed. i.e. we are > + * reallocating a busy extent. Skip this buffer and > + * continue searching to the right for an exact match. > + */ > + if (bp->b_buffer_length != range_length) { > + ASSERT(bp->b_flags & XBF_STALE); This assertion is new. I trust it's correct, but maybe it should go in separately (first). > + rbp = &(*rbp)->rb_right; > + continue; > + } > atomic_inc(&bp->b_hold); > goto found; > } When I first read the next hunk I thought you were mistakenly not dropping the perag reference. Later I realized it was intentional--that the buffer holds a reference on the perag until it (the buffer) is released. This is a minor point but I think it would be helpful to see a comment explaining that. > @@ -449,17 +473,20 @@ _xfs_buf_find( > if (new_bp) { > _xfs_buf_initialize(new_bp, btp, range_base, > range_length, flags); > - new_bp->b_hash = hash; > - list_add(&new_bp->b_hash_list, &hash->bh_list); > + new_bp->b_pag = pag; > + rb_link_node(&new_bp->b_rbnode, parent, rbp); > + rb_insert_color(&new_bp->b_rbnode, &pag->pagbuf_tree); > + spin_unlock(&pag->pagbuf_lock); > } else { > XFS_STATS_INC(xb_miss_locked); > + spin_unlock(&pag->pagbuf_lock); > + xfs_perag_put(pag); > } > - > - spin_unlock(&hash->bh_lock); > return new_bp; > > found: > - spin_unlock(&hash->bh_lock); > + spin_unlock(&pag->pagbuf_lock); > + xfs_perag_put(pag); > > /* Attempt to get the semaphore without sleeping, > * if this does not work then we need to drop the > @@ -810,27 +837,30 @@ void > xfs_buf_rele( > xfs_buf_t *bp) > { > - xfs_bufhash_t *hash = bp->b_hash; > + struct xfs_perag *pag = bp->b_pag; > > trace_xfs_buf_rele(bp, _RET_IP_); > > - if (unlikely(!hash)) { > + if (!pag) { I know this is not related to your change, but I'm curious whether this can even happen, and if so, when? > ASSERT(!bp->b_relse); > + ASSERT(RB_EMPTY_NODE(&bp->b_rbnode)); > if (atomic_dec_and_test(&bp->b_hold)) > xfs_buf_free(bp); > return; > } > > + ASSERT(!RB_EMPTY_NODE(&bp->b_rbnode)); > ASSERT(atomic_read(&bp->b_hold) > 0); > - if (atomic_dec_and_lock(&bp->b_hold, &hash->bh_lock)) { > + if (atomic_dec_and_lock(&bp->b_hold, &pag->pagbuf_lock)) { > if (bp->b_relse) { > atomic_inc(&bp->b_hold); > - spin_unlock(&hash->bh_lock); > + spin_unlock(&pag->pagbuf_lock); > (*(bp->b_relse)) (bp); Drop the excess parentheses here (above) while you're at it. > } else { > ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q))); > - list_del_init(&bp->b_hash_list); > - spin_unlock(&hash->bh_lock); > + rb_erase(&bp->b_rbnode, &pag->pagbuf_tree); > + spin_unlock(&pag->pagbuf_lock); > + xfs_perag_put(pag); > xfs_buf_free(bp); > } > } > @@ -1433,57 +1463,25 @@ xfs_buf_iomove( > */ > void > xfs_wait_buftarg( > - xfs_buftarg_t *btp) > + struct xfs_buftarg *btp) > { > - xfs_bufhash_t *hash; > - uint i; > + struct xfs_perag *pag; > + uint i; > > - for (i = 0; i < (1 << btp->bt_hashshift); i++) { > - hash = &btp->bt_hash[i]; > - spin_lock(&hash->bh_lock); > - while (!list_empty(&hash->bh_list)) { > - spin_unlock(&hash->bh_lock); > + for (i = 0; i < btp->bt_mp->m_sb.sb_agcount; i++) { > + pag = xfs_perag_get(btp->bt_mp, i); > + spin_lock(&pag->pagbuf_lock); > + while (rb_first(&pag->pagbuf_tree)) { > + spin_unlock(&pag->pagbuf_lock); > delay(100); > - spin_lock(&hash->bh_lock); > + spin_lock(&pag->pagbuf_lock); > } > - spin_unlock(&hash->bh_lock); > + spin_unlock(&pag->pagbuf_lock); > + xfs_perag_put(pag); > } > } This suggestion is again not related to your change, but... Would this basic structure (not tested) be better than the above? int more; do { more = 0; for (i = 0; i < btp->bt_mp->m_sb.sb_agcount; i++) { pag = xfs_perag_get(btp->bt_mp, i); spin_lock(&pag->pagbuf_lock); if (rb_first(&pag->pagbuf_tree)) more++; spin_unlock(&pag->pagbuf_lock); xfs_perag_put(pag); } if (more) delay(100); } while (more); > > /* > - * Allocate buffer hash table for a given target. > - * For devices containing metadata (i.e. not the log/realtime devices) > - * we need to allocate a much larger hash table. . . . > @@ -1645,6 +1643,12 @@ xfs_alloc_buftarg( > > btp = kmem_zalloc(sizeof(*btp), KM_SLEEP); > > + /* > + * The buftarg cache should never be used by external devices. I suggest that "buftarg cache" is not a good name for the (now per-AG) buffer cache. > + * Ensure we catch any users with extreme prejudice. > + */ > + btp->bt_mp = external ? NULL : mp; > + > btp->bt_dev = bdev->bd_dev; > btp->bt_bdev = bdev; > if (xfs_setsize_buftarg_early(btp, bdev)) > @@ -1653,7 +1657,6 @@ xfs_alloc_buftarg( > goto error; > if (xfs_alloc_delwrite_queue(btp, fsname)) > goto error; > - xfs_alloc_bufhash(btp, external); > return btp; > > error: > diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h > index 802dc5e..3797ee8 100644 > --- a/fs/xfs/linux-2.6/xfs_buf.h > +++ b/fs/xfs/linux-2.6/xfs_buf.h . . . > @@ -171,8 +170,8 @@ typedef struct xfs_buf { > wait_queue_head_t b_waiters; /* unpin waiters */ > struct list_head b_list; > xfs_buf_flags_t b_flags; /* status flags */ > - struct list_head b_hash_list; /* hash table list */ > - xfs_bufhash_t *b_hash; /* hash table list start */ > + struct rb_node b_rbnode; /* rbtree node */ > + struct xfs_perag *b_pag; /* rbtree root */ Rather, "contains rbtree root" (in the comment). > xfs_buftarg_t *b_target; /* buffer target (device) */ > atomic_t b_hold; /* reference count */ > xfs_daddr_t b_bn; /* block number for I/O */ . . . > diff --git a/fs/xfs/xfs_ag.h b/fs/xfs/xfs_ag.h > index 4917d4e..e01d4cf 100644 > --- a/fs/xfs/xfs_ag.h > +++ b/fs/xfs/xfs_ag.h > @@ -230,6 +230,10 @@ typedef struct xfs_perag { > rwlock_t pag_ici_lock; /* incore inode lock */ > struct radix_tree_root pag_ici_root; /* incore inode cache root */ > int pag_ici_reclaimable; /* reclaimable inodes */ > + > + /* buffer cache index */ > + spinlock_t pagbuf_lock; /* lock for pagbuf_tree */ > + struct rb_root pagbuf_tree; /* ordered tree of active buffers */ I know it's in some way consistent with the rest of the naming scheme for fields in this structure, maybe these could be named pag_buf_lock and pag_buf_tree. (The rest of the fields here have a sort of strange convention and maybe they've got strong enough history that they should be left alone.) > #endif > int pagb_count; /* pagb slots in use */ > } xfs_perag_t; . . . _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs