Re: bluestore blobs

Sage Weil <sweil@xxxxxxxxxx> · Wed, 17 Aug 2016 15:25:04 +0000 (UTC)

On Wed, 17 Aug 2016, Haomai Wang wrote:
> another latency perf problem:
> 
> rocksdb log is on bluefs and mainly uses append and fsync interface to
> complete WAL.
> 
> I found the latency between kv transaction submitting isn't negligible
> and limit the transaction throughput.
> 
> So what if we implement a async transaction submit in rocksdb side
> using callback way? It will decrease kv in queue latency. It would
> help rocksdb WAL performance close to FileJournal. And async interface
> will help control each kv transaction size and make transaction
> complete smoothly instead of tps spike with us precious.

Can we get the same benefit by calling BlueFS::_flush on the log whenever 
we have X bytes accumulated (I think there is an option in rocksdb that 
drives this already, actually)?  Changing the interfaces around will 
change the threading model (= work) but doesn't actually change who needs 
to wait and when.

sage

> 
> 
> On Wed, Aug 17, 2016 at 10:26 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> > I think we need to look at other changes in addition to the encoding
> > performance improvements.  Even if they end up being good enough, these
> > changes are somewhat orthogonal and at least one of them should give us
> > something that is even faster.
> >
> > 1. I mentioned this before, but we should keep the encoding
> > bluestore_blob_t around when we load the blob map.  If it's not changed,
> > don't reencode it.  There are no blockers for implementing this currently.
> > It may be difficult to ensure the blobs are properly marked dirty... I'll
> > see if we can use proper accessors for the blob to enforce this at compile
> > time.  We should do that anyway.
> >
> > 2. This turns the blob Put into rocksdb into two memcpy stages: one to
> > assemble the bufferlist (lots of bufferptrs to each untouched blob)
> > into a single rocksdb::Slice, and another memcpy somewhere inside
> > rocksdb to copy this into the write buffer.  We could extend the
> > rocksdb interface to take an iovec so that the first memcpy isn't needed
> > (and rocksdb will instead iterate over our buffers and copy them directly
> > into its write buffer).  This is probably a pretty small piece of the
> > overall time... should verify with a profiler before investing too much
> > effort here.
> >
> > 3. Even if we do the above, we're still setting a big (~4k or more?) key
> > into rocksdb every time we touch an object, even when a tiny amount of
> > metadata is getting changed.  This is a consequence of embedding all of
> > the blobs into the onode (or bnode).  That seemed like a good idea early
> > on when they were tiny (i.e., just an extent), but now I'm not so sure.  I
> > see a couple of different options:
> >
> > a) Store each blob as ($onode_key+$blobid).  When we load the onode, load
> > the blobs too.  They will hopefully be sequential in rocksdb (or
> > definitely sequential in zs).  Probably go back to using an iterator.
> >
> > b) Go all in on the "bnode" like concept.  Assign blob ids so that they
> > are unique for any given hash value.  Then store the blobs as
> > $shard.$poolid.$hash.$blobid (i.e., where the bnode is now).  Then when
> > clone happens there is no onode->bnode migration magic happening--we've
> > already committed to storing blobs in separate keys.  When we load the
> > onode, keep the conditional bnode loading we already have.. but when the
> > bnode is loaded load up all the blobs for the hash key.  (Okay, we could
> > fault in blobs individually, but that code will be more complicated.)
> >
> > In both these cases, a write will dirty the onode (which is back to being
> > pretty small.. just xattrs and the lextent map) and 1-3 blobs (also now
> > small keys).  Updates will generate much lower metadata write traffic,
> > which'll reduce media wear and compaction overhead.  The cost is that
> > operations (e.g., reads) that have to fault in an onode are now fetching
> > several nearby keys instead of a single key.
> >
> >
> > #1 and #2 are completely orthogonal to any encoding efficiency
> > improvements we make.  And #1 is simple... I plan to implement this
> > shortly.
> >
> > #3 is balancing (re)encoding efficiency against the cost of separate keys,
> > and that tradeoff will change as encoding efficiency changes, so it'll be
> > difficult to properly evaluate without knowing where we'll land with the
> > (re)encode times.  I think it's a design decision made early on that is
> > worth revisiting, though!
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html