RE: BlueStore Allocator.

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Wed, 16 Mar 2016 20:10:00 +0000

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Wednesday, March 16, 2016 3:03 PM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> Cc: Samuel Just <sjust@xxxxxxxxxx>; ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: BlueStore Allocator.
> 
> On Fri, 11 Mar 2016, Allen Samuels wrote:
> > As promised, here is my description of data structures for an
> > allocator to be based on a bitmap-based freelist.
> >
> > I would propose for each bucket size of the current allocator (4K, 8K,
> > 16K, ...) that there be an instance of a fast searching structure.
> > Each fast searching structure consists of a hierarchy of bitmaps. Each
> > level of the bitmap contains the reduction of some fixed number of
> > bits of the lower level of the bitmap, e.g., for each 64 bits at level
> > N there is one bit at level N+1. There would be
> > ceiling(log64(<freelist-size>)) number of levels (#levels). You might
> > consider this the equivalent of a fully-populated tree with a fan-out
> > of 64 on each level. (The 64 is arbitrary, but a good choice for a
> > 64-bit CPU. It could easily be parameterized for 32-bits on a 32-bit CPU).
> >
> > With this structure, you can locate a free item in o(#levels calls to
> > ffo) (find first one). That and the usual shifting/masking stuff. This
> > is true whether you start from a hint or not. This is exactly
> > analagous to the current interval_set (or btree_map) implementation
> > except that all of the complex tree manipulation stuff is reduced to
> > some simple shifting and vector indexing operations. Inserts and
> > deletes are relatively easy and require only O(#level) compare and
> > bit-set/masking operations (which can be truncated in many cases).
> >
> > IMO, this data structure is universally faster than the current
> > tree-based extent maps (either RB or btree_map), supports the same set
> > of operational primitives all with worst-case complexity of O(#levels)
> > [effectively O(1) for our sizes] and is much more easily parallelized
> > than the current implementation.
> >
> > This data structure has constant memory consumption which is probably
> > somewhat worse than the current implementation in typical scenarios
> > and dramatically better in worst case scenarios.
> >
> > Memory consumption is (again, please correct the math when I make a
> > mistake):
> >
> > For 4K objects. we'll make one bit of the level[0] cover 64-bits of
> > the freelist. That means that for each TB of storage we'll have
> > 2^40/2^12 =>
> > 2^28 bits of freelist which will be 2^22 bits in level 0, which is
> > 2^19 bytes => 512KB. Level 1 will take 1/64th of that which is 8KB
> > level -- which we will ignore.
> >
> > For  8K buckets, the first level is 1/2 of that => 256KB For 16K
> > buckets, the first level is 1/2 of that => 128KB for 32K buckets, the
> > first level is 1/2 of that => 64KB .....<As many bucket levels as you
> > want, memory consumption becomes irrelevant)
> >
> > Total consumption for each TB of storage is then 512K + 256K + 128K +
> > 64K +.... which is about 1MB total. (probably something like 1.1 or
> > 1.2MB when you include the smaller higher levels that I've ignored)
> >
> > Parallelizing this structure is trivial, you can just shard it in the
> > spatial domain.
> 
> It would be nice if we made the internal nodes two bits instead of one, so
> that they could represent "00 = no allocations" and "11 = fully allocated"
> (and the sub-key doesn't even need to be instantiated) in addition to "01 =
> partially allocated" (and sub-key has the next level of detail down).
> 

Since you're only looking for free blocks, you don't care about the difference between "partially allocated" and "no allocations". All you care about is whether there's at least one allocation below you.

> This will reduce memory usage in the general case (where hopefully lots of
> space is either fully allocated or fully free).

No, it actually doubles the memory allocation.

> 
> This might screw up the parallelism, though, with the kv transactions.
> Perhaps we do that only for the in-memory representation, and keep the kv
> representation simple (and pay for the sequential scan on startup).
> 

I don't think it affects the parallelism any at all.

> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html