Re: [RFC] RCU Judy array with distributed locking for FS extents

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 6 Jun 2013 09:48:13 +1000

On Tue, Jun 04, 2013 at 10:21:32AM -0400, Chris Mason wrote:
> Quoting Dave Chinner (2013-06-04 07:54:56)
> > On Mon, Jun 03, 2013 at 08:46:01AM -0400, Mathieu Desnoyers wrote:
> > > * Chris Mason (clmason@xxxxxxxxxxxx) wrote:
> > > > Quoting Mathieu Desnoyers (2013-06-03 01:27:58)
> > > > > Hi Chris,
> > > > > 
> > > > > I stumbled the LWN article "A kernel skiplist implementation (Part 1)",
> > > > > and really thought I should tell you about the side-project I am
> > > > > currently working on: RCU Judy array, with locking distributed within
> > > > > the data structure. I developed my own design of this structure
> > > > > specifically to be usable with RCU. (ref. my LPC2012 presentation:
> > > > > https://www.efficios.com/lpc2012-scaling-rcu-judy-arrays-cache-efficient-compact-fast-and-scalable-trie)
> > > > > 
> > > > > I think it would fit your extents use-case perfectly, with excellent
> > > > > cache usage, constant lookup time, near-linear scalability for lookups,
> > > > > and pretty good update scability.
> > > > > 
> > > > > The code is available in a development branch of the Userspace RCU
> > > > > library:
> > > > > 
> > > > > git://git.lttng.org/userspace-rcu.git branch: urcu/rcuja
> > > > > https://git.lttng.org/?p=userspace-rcu.git;a=shortlog;h=refs/heads/urcu/rcuja
> > > > > 
> > > > > It runs entirely in user-space, although it can be ported easily to the
> > > > > Linux kernel. I'd be curious to see how it behaves with your workload.
> > > > > 
> > > > > Relevant files:
> > > > > - rcuja/design.txt: design document of the data structure,
> > > > > - urcu/rcuja.h (public API)
> > > > > - rcuja/*.[ch]: implementation
> > > > > - tests/test_urcu_ja.c: stress-tests, example usage.
> > > > > 
> > > > > After reading the article on skiplists, I added a new API specifically
> > > > > to handle non-overlapping ranges:
> > > > > 
> > > > > struct cds_hlist_head cds_ja_lookup_lower_equal(struct cds_ja *ja,
> > > > >                 uint64_t key);
> > > > > 
> > > > > AFAIU, "extents", as far as keys are concerned, are nothing more than
> > > > > segments, with start and end values.
> > > > > 
> > > > > Extents can be indexed by the Judy array in the following way: we use
> > > > > the start of segment as node key, and keep the end of segment value
> > > > > within the leaf node. We add those nodes into the judy array, indexed by
> > > > > start-of-segment value. Then, when we want to figure out if a value
> > > > > matches a segment, we do the following:
> > > > > 
> > > > > 1) lookup the possible segment match with cds_ja_lookup_lower_equal(),
> > > > >    passing the key to lookup as parameter,
> > > > > 
> > > > > 2) check that our key fits within the range of the segment using the
> > > > >    end-of-segment value within the leaf node.
> > > > > 
> > > > > Of course, this approach is limited to non-overlapping segments, so I
> > > > > hope this is what you need.
> > > > 
> > > > Hi Mathieu,
> > > > 
> > > > One problem here is that XFS wants to allow duplicate keys in the tree.
> > > > This is possible with some modifications to the skiplist code, but I'm
> > > > not sure if it fits into your description above.
> > > 
> > > Are those segments that completely overlap, or partially overlap ?
> > 
> > Partial overlap. Can be wholly within an existing exist, or overlap
> > start, end or both.
> 
> Ouch, ok.  In private email yesterday I talked with Mathieu about how
> his current setup can't prevent the concurrent insertion of overlapping
> extents.  He does have a plan to address this where the insertion is
> synchronized by keeping placeholders in the tree for the free space.  I
> think it'll work, but I'm worried about doubling the cost of the insert.

Insert cost probably isn't an issue for XFS. Most workloads have a
buffer lookup ratio that is dominated by reads and cache hits.

FWIW, I've been thinking of ways of avoiding needing overlaps in the
tree to make it simpler to drop in replacements.  The overlaps stem
from buffers that map freed extents that haven't yet been written to
the journal. They are free space, but we have to keep tracking it
until the transactiont hat freed the extent hits the disk. In the
mean time, it's free space, so it can be reallocated and a new
buffer can be built that overlaps part or all of that extent that
has been freed. If it's a perfect match we just reuse the existing
buffer, but if it isn't we need to have both the new buffer and the
old buffer in cache for some period of time...

There's a couple of things I think I can do to rework this so that
we don't need to track buffers over free space in the main cache,
and that would remove all the overlap issues from the cache. That
would certainly make things easier for dropping in some other type
of index that doesn't support overlaps easily...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html