RE: bluestore blobs

Sage Weil <sweil@xxxxxxxxxx> · Thu, 18 Aug 2016 15:10:22 +0000 (UTC)

On Thu, 18 Aug 2016, Allen Samuels wrote:
> > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> > Sent: Wednesday, August 17, 2016 7:26 AM
> > To: ceph-devel@xxxxxxxxxxxxxxx
> > Subject: bluestore blobs
> > 
> > I think we need to look at other changes in addition to the encoding
> > performance improvements.  Even if they end up being good enough, these
> > changes are somewhat orthogonal and at least one of them should give us
> > something that is even faster.
> > 
> > 1. I mentioned this before, but we should keep the encoding
> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> > don't reencode it.  There are no blockers for implementing this currently.
> > It may be difficult to ensure the blobs are properly marked dirty... I'll see if
> > we can use proper accessors for the blob to enforce this at compile time.  We
> > should do that anyway.
> 
> If it's not changed, then why are we re-writing it? I'm having a hard 
> time thinking of a case worth optimizing where I want to re-write the 
> oNode but the blob_map is unchanged. Am I missing something obvious?

An onode's blob_map might have 300 blobs, and a single write only updates 
one of them.  The other 299 blobs need not be reencoded, just memcpy'd.

> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> > assemble the bufferlist (lots of bufferptrs to each untouched blob) into a
> > single rocksdb::Slice, and another memcpy somewhere inside rocksdb to
> > copy this into the write buffer.  We could extend the rocksdb interface to
> > take an iovec so that the first memcpy isn't needed (and rocksdb will instead
> > iterate over our buffers and copy them directly into its write buffer).  This is
> > probably a pretty small piece of the overall time... should verify with a
> > profiler before investing too much effort here.
> 
> I doubt it's the memcpy that's really the expensive part. I'll bet it's 
> that we're transcoding from an internal to an external representation on 
> an element by element basis. If the iovec scheme is going to help, it 
> presumes that the internal data structure essentially matches the 
> external data structure so that only an iovec copy is required. I'm 
> wondering how compatible this is with the current concepts of 
> lextext/blob/pextent.

I'm thinking of the xattr case (we have a bunch of strings to copy 
verbatim) and updated-one-blob-and-kept-99-unchanged case: instead 
of memcpy'ing them into a big contiguous buffer and having rocksdb 
memcpy *that* into it's larger buffer, give rocksdb an iovec so that they 
smaller buffers are assembled only once.

These buffers will be on the order of many 10s to a couple 100s of bytes.  
I'm not sure where the crossover point for constructing and then 
traversing an iovec vs just copying twice would be...

> > 3. Even if we do the above, we're still setting a big (~4k or more?) key into
> > rocksdb every time we touch an object, even when a tiny amount of
> > metadata is getting changed.  This is a consequence of embedding all of the
> > blobs into the onode (or bnode).  That seemed like a good idea early on
> > when they were tiny (i.e., just an extent), but now I'm not so sure.  I see a
> > couple of different options:
> > 
> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> > the blobs too.  They will hopefully be sequential in rocksdb (or definitely
> > sequential in zs).  Probably go back to using an iterator.
> > 
> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they are
> > unique for any given hash value.  Then store the blobs as
> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> > clone happens there is no onode->bnode migration magic happening--we've
> > already committed to storing blobs in separate keys.  When we load the
> > onode, keep the conditional bnode loading we already have.. but when the
> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could fault
> > in blobs individually, but that code will be more complicated.)
> > 
> > In both these cases, a write will dirty the onode (which is back to being pretty
> > small.. just xattrs and the lextent map) and 1-3 blobs (also now small keys).
> > Updates will generate much lower metadata write traffic, which'll reduce
> > media wear and compaction overhead.  The cost is that operations (e.g.,
> > reads) that have to fault in an onode are now fetching several nearby keys
> > instead of a single key.
> > 
> > 
> > #1 and #2 are completely orthogonal to any encoding efficiency
> > improvements we make.  And #1 is simple... I plan to implement this shortly.
> > 
> > #3 is balancing (re)encoding efficiency against the cost of separate keys, and
> > that tradeoff will change as encoding efficiency changes, so it'll be difficult to
> > properly evaluate without knowing where we'll land with the (re)encode
> > times.  I think it's a design decision made early on that is worth revisiting,
> > though!
> 
> It's not just the encoding efficiency, it's the cost of KV accesses. For 
> example, we could move the lextent map into the KV world similarly to 
> the way that you're suggesting the blob_maps be moved. You could do it 
> for the xattrs also. Now you've almost completely eliminated any 
> serialization/deserialization costs for the LARGE oNodes that we have 
> today but have replaced that with several KV lookups (one small Onode, 
> probably an xAttr, an lextent and a blob_map).
> 
> I'm guessing that the "right" point is in between. I doubt that 
> separating the oNode from the xattrs pays off (especially since the 
> current code pretty much assumes that they are all cheap to get at).

Yep.. this is why it'll be a hard call to make, esp when the encoding 
efficiency is changing at the same time.  I'm calling out blobs here 
because they are biggish (lextents are tiny) and nontrivial to encode 
(xattrs are just strings).

> I'm wondering if it pays off to make each lextent entry a separate 
> key/value vs encoding the entire extent table (several KB) as a single 
> value. Same for the blobmap (though I suspect they have roughly the same 
> behavior w.r.t. this particular parameter)

I'm guessing no because they are so small that the kv overhead will dwarf 
the encoding cost, but who knows.  I think implementing the blob case 
won't be so bad and will give us a better idea (i.e., blobs are bigger and 
more expensive and if it's not a win there then certainly don't bother 
with lextents).

> We need to temper this experiment with the notion that we change the 
> lextent/blob_map encoding to something that doesn't require transcoding 
> -- if possible.

Right.  I don't have any bright ideas here, though.  The variable length 
encoding makes this really hard and we still care about keeping things 
small.

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html