RE: bluestore blobs REVISITED

Sage Weil <sweil@xxxxxxxxxx> · Tue, 23 Aug 2016 19:39:38 +0000 (UTC)

On Tue, 23 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Tuesday, August 23, 2016 12:03 PM
> > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> > Cc: ceph-devel@xxxxxxxxxxxxxxx
> > Subject: RE: bluestore blobs REVISITED
> > 
> > I just got the onode, including the full lextent map, down to ~1500 bytes.
> > The lextent map encoding optimizations went something like this:
> > 
> > - 1 blobid bit to indicate that this lextent starts where the last one ended.
> > (6500->5500)
> > 
> > - 1 blobid bit to indicate offset is 0; 1 blobid bit to indicate length is same as
> > previous lextent.  (5500->3500)
> > 
> > - make blobid signed (1 bit) and delta encode relative to previous blob.
> > (3500->1500).  In practice we'd get something between 1500 and 3500
> > because blobids won't have as much temporal locality as my test workload.
> > OTOH, this is really important because blobids will also get big over time (we
> > don't (yet?) have a way to reuse blobids and/or keep them unique to a hash
> > key, so they grow as the osd ages).
> 
> This seems fishy to me, my mental model for the blob_id suggests that it 
> must be at least 9 bits (for a random write workload) in size (1K 
> entries randomly written lead to an average distance of 512, which means 
> 10 bits to encode -- plus the other optimizations bits). Meaning that 
> you're going to have two bytes for each lextent, meaning at least 2048 
> bytes of lextent plus the remainder of the oNode. So, probably more like 
> 2500 bytes.
> 
> Am I missing something?

Yeah, it'll be ~2 bytes in general, not 1, so closer to 2500 bytes.  I 
this is still enough to get us under 4K of metadata if we make pg_stat_t 
encoding space-efficient (and possibly even without out).

> > https://github.com/liewegas/ceph/blob/wip-bluestore-
> > blobwise/src/os/bluestore/bluestore_types.cc#L826
> > 
> > This makes the metadata update for a 4k write look like
> > 
> >   1500 byte onode update
> >   20 byte blob update
> >   182 byte + 847(*) byte omap updates (pg log)
> > 
> >   * pg _info key... def some room for optimization here I think!
> > 
> > In any case, this is pretty encouraging.  I think we have a few options:
> > 
> > 1) keep extent map in onode and (re)encode fully each time (what I have
> > now).  blobs live in their own keys.
> > 
> > 2) keep extent map in onode but shard it in memory and only reencode the
> > part(s) that get modified.  this will alleviate the cpu concerns around a more
> > complex encoding.  no change to on-disk format.
> > 
> 
> This will save some CPU, but the write-amp is still pretty bad unless 
> you get the entire commit to < 4K bytes on Rocks.
> 
> > 3) join lextents and blobs (Allen's proposal) and dynamically bin 
> > based on the encoded size.
> > 
> > 4) #3, but let shard_size=0 in onode (or whatever) put it inline with onode, so
> > that simple objects avoid any additional kv op.
> 
> Yes, I'm still of the mind that this is the best approach. I'm not sure 
> it's really that hard because most of the "special" cases can be dealt 
> with in a common brute-force way (because they don't happen too often).

Yep.  I think my next task is to code this one up.  Before I start, 
though, I want to make sure I'm doing the most reasonable thing.  Please 
review:

Right now the blobwise branch has (unshared and) shared blobs in their own 
key.  Shared blobs only update when they are occluded and their ref_map 
changes.  If it weren't for that, we could go back to cramming them 
together in a single bnode key, but I'm I think we'll do better with them 
as separate blob keys.  This helps small reads (we don't load all blobs), 
hurts large reads (we read them all from separate keys instead of all at 
once), and of course helps writes because we only update a single small 
blob record.

The onode will get a extent_map_shard_size, which is measured in bytes and 
tells us how the map is chunked into keys.  If it's 0, the whole map will 
stay inline in the onode.  Otherwise, [0..extent_map_shard_size) is in the 
first key, [extent_map_shard_size, extent_map_shard_size*2) is in the 
second, etc.

In memory, we can pull this out of the onode_t struct and manage it in 
Onode where we can keep track of which parts of loaded, dirty, etc.

For each map chunk, we can do the inline blob trick.  Unfortunately we can 
only do this for 'unshared' blobs where shared is now shared across 
extent_map shards and not across objects.  We can conservatively 
estimate this by just looking at the blob size w/o looking at other 
lextents, at least.

I think we can also do a bit better than Varada's current map<> used 
during encode/decode.  As we iterate over the map, we can store the 
ephemeral blob id *for this encoding* in the Blob (not blob_t), and use 
the low bit of the blob id to indicate it is an ephemeral id vs a real 
one.  On decode we can size a vector at start and fill it with BlobRef 
entries as we go to avoid a dynamic map<> or other structure.  Another 
nice thing here is that we only have to assign global blobids when we are 
stored in a blob key instead of inline in the map, which makes the blob 
ids smaller (with fewer significant bits).

The Bnode then goes from a map of all blobs for the hash to a map of only 
shared (between shards or objects) for the hash--i.e., those blobs that 
have a blobid.  I think to make this work we then also need to change the 
in-memory lextent map from map<uint64_t,bluestore_lextent_t> to 
map<uint64_t,Lextent> so that we can handle the local vs remote pointer 
cases and for the local ones keep store the BlobRef right there.  This'll 
be marginally more complex to code (no 1:1 mapping of in-memory to encoded 
structures) but it should be faster (one less in-memory map lookup).

Anyway, that's the my current plan. Any thoughts before I start?
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html