RE: bluestore blobs

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Fri, 19 Aug 2016 03:11:50 +0000

> -----Original Message-----
> From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> Sent: Thursday, August 18, 2016 8:10 AM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> Cc: ceph-devel@xxxxxxxxxxxxxxx
> Subject: RE: bluestore blobs
> 
> On Thu, 18 Aug 2016, Allen Samuels wrote:
> > > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> > > owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> > > Sent: Wednesday, August 17, 2016 7:26 AM
> > > To: ceph-devel@xxxxxxxxxxxxxxx
> > > Subject: bluestore blobs
> > >
> > > I think we need to look at other changes in addition to the encoding
> > > performance improvements.  Even if they end up being good enough,
> > > these changes are somewhat orthogonal and at least one of them
> > > should give us something that is even faster.
> > >
> > > 1. I mentioned this before, but we should keep the encoding
> > > bluestore_blob_t around when we load the blob map.  If it's not
> > > changed, don't reencode it.  There are no blockers for implementing this
> currently.
> > > It may be difficult to ensure the blobs are properly marked dirty...
> > > I'll see if we can use proper accessors for the blob to enforce this
> > > at compile time.  We should do that anyway.
> >
> > If it's not changed, then why are we re-writing it? I'm having a hard
> > time thinking of a case worth optimizing where I want to re-write the
> > oNode but the blob_map is unchanged. Am I missing something obvious?
> 
> An onode's blob_map might have 300 blobs, and a single write only updates
> one of them.  The other 299 blobs need not be reencoded, just memcpy'd.

As long as we're just appending that's a good optimization. How often does that happen? It's certainly not going to help the RBD 4K random write problem.

> 
> > > 2. This turns the blob Put into rocksdb into two memcpy stages: one
> > > to assemble the bufferlist (lots of bufferptrs to each untouched
> > > blob) into a single rocksdb::Slice, and another memcpy somewhere
> > > inside rocksdb to copy this into the write buffer.  We could extend
> > > the rocksdb interface to take an iovec so that the first memcpy
> > > isn't needed (and rocksdb will instead iterate over our buffers and
> > > copy them directly into its write buffer).  This is probably a
> > > pretty small piece of the overall time... should verify with a profiler
> before investing too much effort here.
> >
> > I doubt it's the memcpy that's really the expensive part. I'll bet
> > it's that we're transcoding from an internal to an external
> > representation on an element by element basis. If the iovec scheme is
> > going to help, it presumes that the internal data structure
> > essentially matches the external data structure so that only an iovec
> > copy is required. I'm wondering how compatible this is with the
> > current concepts of lextext/blob/pextent.
> 
> I'm thinking of the xattr case (we have a bunch of strings to copy
> verbatim) and updated-one-blob-and-kept-99-unchanged case: instead of
> memcpy'ing them into a big contiguous buffer and having rocksdb memcpy
> *that* into it's larger buffer, give rocksdb an iovec so that they smaller
> buffers are assembled only once.
> 
> These buffers will be on the order of many 10s to a couple 100s of bytes.
> I'm not sure where the crossover point for constructing and then traversing
> an iovec vs just copying twice would be...
> 

Yes this will eliminate the "extra" copy, but the real problem is that the oNode itself is just too large. I doubt removing one extra copy is going to suddenly "solve" this problem. I think we're going to end up rejiggering things so that this will be much less of a problem than it is now -- time will tell.

> > > 3. Even if we do the above, we're still setting a big (~4k or more?)
> > > key into rocksdb every time we touch an object, even when a tiny

See my analysis, you're looking at 8-10K for the RBD random write case -- which I think everybody cares a lot about.

> > > amount of metadata is getting changed.  This is a consequence of
> > > embedding all of the blobs into the onode (or bnode).  That seemed
> > > like a good idea early on when they were tiny (i.e., just an
> > > extent), but now I'm not so sure.  I see a couple of different options:
> > >
> > > a) Store each blob as ($onode_key+$blobid).  When we load the onode,
> > > load the blobs too.  They will hopefully be sequential in rocksdb
> > > (or definitely sequential in zs).  Probably go back to using an iterator.
> > >
> > > b) Go all in on the "bnode" like concept.  Assign blob ids so that
> > > they are unique for any given hash value.  Then store the blobs as
> > > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then
> > > when clone happens there is no onode->bnode migration magic
> > > happening--we've already committed to storing blobs in separate
> > > keys.  When we load the onode, keep the conditional bnode loading we
> > > already have.. but when the bnode is loaded load up all the blobs
> > > for the hash key.  (Okay, we could fault in blobs individually, but
> > > that code will be more complicated.)

I like this direction. I think you'll still end up demand loading the blobs in order to speed up the random read case. This scheme will result in some space-amplification, both in the lextent and in the blob-map, it's worth a bit of study too see how bad the metadata/data ratio becomes (just as a guess, $shard.$poolid.$hash.$blobid is probably 16 + 16 + 8 + 16 bytes in size, that's ~60 bytes of key for each Blob -- unless your KV store does path compression. My reading of RocksDB sst file seems to indicate that it doesn't, I *believe* that ZS does [need to confirm]). I'm wondering if the current notion of local vs. global blobs isn't actually beneficial in that we can give local blobs different names that sort with their associated oNode (which probably makes the space-amp worse) which is an important optimization. We do need to watch the space amp, we're going to be burning DRAM to make KV accesses cheap and the amount of DRAM is proportional to the space amp.

> > >
> > > In both these cases, a write will dirty the onode (which is back to
> > > being pretty small.. just xattrs and the lextent map) and 1-3 blobs (also
> now small keys).

I'm not sure the oNode is going to be that small. Looking at the RBD random 4K write case, you're going to have 1K entries each of which has an offset, size and a blob-id reference in them. In my current oNode compression scheme this compresses to about 1 byte per entry. However, this optimization relies on being able to cheaply renumber the blob-ids, which is no longer possible when the blob-ids become parts of a key (see above). So now you'll have a minimum of 1.5-3 bytes extra for each blob-id (because you can't assume that the blob-ids become "dense" anymore) So you're looking at 2.5-4 bytes per entry or about 2.5-4K Bytes of lextent table. Worse, because of the variable length encoding you'll have to scan the entire table to deserialize it (yes, we could do differential editing when we write but that's another discussion). Oh and I forgot to add the 200-300 bytes of oNode and xattrs :). So while this looks small compared to the current ~30K for the entire thing oNode/lextent/blobmap, it's NOT a huge gain over 8-10K of the compressed oNode/lextent/blobmap scheme that I published earlier.

If we want to do better we will need to separate the lextent from the oNode also. It's relatively easy to move the lextents into the KV store itself (there are two obvious ways to deal with this, either use the native offset/size from the lextent itself OR create 'N' buckets of logical offset into which we pour entries -- both of these would add somewhere between 1 and 2 KV look-ups per operation -- here is where an iterator would probably help.

Unfortunately, if you only process a portion of the lextent (because you've made it into multiple keys and you don't want to load all of them) you no longer can re-generate the refmap on the fly (another key space optimization). The lack of refmap screws up a number of other important algorithms -- for example the overlapping blob-map thing, etc. Not sure if these are easy to rewrite or not -- too complicated to think about at this hour of the evening.

> > > Updates will generate much lower metadata write traffic, which'll
> > > reduce media wear and compaction overhead.  The cost is that
> > > operations (e.g.,
> > > reads) that have to fault in an onode are now fetching several
> > > nearby keys instead of a single key.
> > >
> > >
> > > #1 and #2 are completely orthogonal to any encoding efficiency
> > > improvements we make.  And #1 is simple... I plan to implement this
> shortly.
> > >
> > > #3 is balancing (re)encoding efficiency against the cost of separate
> > > keys, and that tradeoff will change as encoding efficiency changes,
> > > so it'll be difficult to properly evaluate without knowing where
> > > we'll land with the (re)encode times.  I think it's a design
> > > decision made early on that is worth revisiting, though!
> >
> > It's not just the encoding efficiency, it's the cost of KV accesses.
> > For example, we could move the lextent map into the KV world similarly
> > to the way that you're suggesting the blob_maps be moved. You could do
> > it for the xattrs also. Now you've almost completely eliminated any
> > serialization/deserialization costs for the LARGE oNodes that we have
> > today but have replaced that with several KV lookups (one small Onode,
> > probably an xAttr, an lextent and a blob_map).
> >
> > I'm guessing that the "right" point is in between. I doubt that
> > separating the oNode from the xattrs pays off (especially since the
> > current code pretty much assumes that they are all cheap to get at).
> 
> Yep.. this is why it'll be a hard call to make, esp when the encoding efficiency
> is changing at the same time.  I'm calling out blobs here because they are
> biggish (lextents are tiny) and nontrivial to encode (xattrs are just strings).
> 
> > I'm wondering if it pays off to make each lextent entry a separate
> > key/value vs encoding the entire extent table (several KB) as a single
> > value. Same for the blobmap (though I suspect they have roughly the
> > same behavior w.r.t. this particular parameter)
> 
> I'm guessing no because they are so small that the kv overhead will dwarf the
> encoding cost, but who knows.  I think implementing the blob case won't be
> so bad and will give us a better idea (i.e., blobs are bigger and more
> expensive and if it's not a win there then certainly don't bother with
> lextents).
> 
> > We need to temper this experiment with the notion that we change the
> > lextent/blob_map encoding to something that doesn't require
> > transcoding
> > -- if possible.
> 
> Right.  I don't have any bright ideas here, though.  The variable length
> encoding makes this really hard and we still care about keeping things small.

Without some clear measurements on the KV-get cost vs. object size (copy in/out plus serialize/deserialize) it's going to be difficult to figure out what to do.

> 
> sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html