RE: A way to reduce BlueStore KV traffic

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Tue, 13 Dec 2016 21:38:39 +0000

Yes, the KV interface provides a lot of really good semantics and good structuring about the way to think about the problem. However, it's a truism that a special-purpose thing can beat a general-purpose thing. The question is by how much and what's the cost. AND does the usage justify the investment required.

The KV interface itself has a number of sub-optimal issues when used with Ceph (and you NEED to talk about it in the context of usage).
AND the RocksDB implementation also is a poor match for Ceph usage patterns.

w.r.t. the KV interface itself. It's nice and clean to treat all keys the same way, but in reality we know that many of the different keys exhibit wildly different characteristics. The inability to connect specific behaviors to specific keys is a lost opportunity for optimization.
You can broadly separate the optimizations into two categories: behavioral and representational. 

An example of behavioral optimizations are the allocation bitmap and pg-log/pg-info data structures. There are very specific optimizations that can be performed for these keys that can have significant performance implications. The push_back, pop_front behavior of the pg-log is ripe for optimization as is the desire to lock all of the allocation bitmap entries into memory. 

Representation optimizations can be significant too. The KV interface doesn't support simplifying items like an array (which would be ideal for allocation_bitmap and probably other things too). Instead you have to redundantly repeat keys with totally predictable values. This inhibits the ability to have fine-grained objects.

Another problem is the value-based semantics of the interface (copy-in/copy-out). They're nice and clean and easy to implement, but they are expensive for large data structures that are unchanged or slightly changed (think oNode). We've just spent 5 months (July to December) recovering from the unexpected need to shard the oNode/extent maps. This efforts largely stems from the value-based semantics of the KV interface. There's simply no efficient substitute for reference-based semantics when dealing with large objects -- this does create a HUGE increment in complexity (locking issues mainly)-- but you get back a lot of CPU time as a result. One can argue that we just did a part of the space management stuff that Sage is claiming we got to avoid by using a KV-store (instead of directly managing oNodes in blocks). Note the compounding of inefficiencies due to the lack of arrays in the encoding of shards with the oNodes.

Another problem with the value-based semantics is the inability to leverage previous lookups. For example, with a sharded oNode, we'll do two KV lookups, one for the oNode and (at least) one for the extent shard. Clearly the extent shard would be most profitably located by forward scanning from the location of the oNode (we've cleverly arranged for them to be "close" in key space) but the interface semantics prevent that. 

The implementation of RocksDB itself is a poor match for Ceph. The LSM tree is adept at batching together multiple mutations into a single update of a portion of the metadata store. Inherently this works well when the mutations are clustered in key space (pg-log/pg-info, allocation, WAL). However, it is a disasterous mismatch for Ceph oNodes (which dominate the metadata). That’s because we hash the oNode and use the hash value as the major part of the key. This serves to spread out (un-cluster) mutations. In other words, the hashing behavior guarantees that regardless of application niceness (sequential access?) the resulting mutations are spread evenly through the metadata space -- a worst-case merge for RocksDB.

ZetaScale is a B-tree implementation of a KV Store. It handles the randomness of the mutations of the oNodes well, but performs poorly on the clustered mutations like pg-log/pg-info, allocation and WAL, when compared to Rocks.  For that last year (or so) the ZS team has been struggling to create special-purpose implementations of exactly those portions of the key space (actually, we have a "shim" that recognizes those keys and does special handling of them). 

> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
> Sent: Tuesday, December 13, 2016 10:32 AM
> To: Somnath Roy <Somnath.Roy@xxxxxxxxxxx>
> Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Igor Fedotov
> <ifedotov@xxxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: RE: A way to reduce BlueStore KV traffic
> 
> On Tue, 13 Dec 2016, Somnath Roy wrote:
> > Igor,
> > This is nice but I am not yet ready to give up on DBs yet :-) This is adding
> extra complexity on deployment  as well. Not sure about the extra
> read/write impact on slower devices as well , your benchmark should reveal
> that.
> > There are lot of effort going on to reduce db traffic, optimize kvs etc. and
> hopefully that will resolve the matter. Otherwise, we need to go with this
> route probably.
> 
> I tend to agree.  I don't think the kv interface is not to blame here, but the
> implementation behind it.  It gives you efficient lookups and space utilization,
> locality, ordered object enumeration, and atomic transactions.  The current
> set of tradeoffs in rocksdb is proving to be less than ideal, but I think that's
> something that can be fixed either within rocksdb or by swapping in another
> kv db.
> 
> Managing onodes on disk manually means we have to implement the
> ordered lookup, or keep using kv just for the name -> onode block
> translation.
> And then deal with space reclamaition, etc.  I don't think it's just a complexity
> trade-off, though, where doing it ourselves means doing it better.. the same
> can be done by teh kv store itself.  For example, on flash, if we don't need
> locality, we could keep the name -> onode (nid?) mapping inline and put the
> metadata itself elsewhere.  Some kv db's do that, including LSM-based ones:
> they put all the keys in one set of sst files and the data/values in another set,
> and show pretty impressive gains (esp when, say, putting hte key data on ssd
> and value data on hdd).
> 
> I would be inclined to look at rocksdb alternatives like ZS or maybe tokudb
> (https://github.com/percona/PerconaFT) before scrapping the kv store
> entirely...
> 
> I also think there's opportunity to improve the pg log kv behavior with a time-
> series friendly column family or similar.  The current rocksdb compaction
> strategies aren't quite right, but i don't think they're too far off from
> something that will avoid any compaction of pg log data in the general case.
> 
> I'll take a look a mull this over some more...
> 
> > On 12/13/2016 10:21 AM, Igor Fedotov wrote:
> > > Hey cephers!
> > >
> > > We are fighting for KV traffic reduction in BlueStore for a while.
> > > We're pushing huge amount of data to KV (object meta data,  write
> > > ahead log, block device allocation map etc ) and this impacts the
> > > performance dramatically. Below I'm trying to fix that by storing
> > > most of object meta data to block device directly. Actually we
> > > should use a second
> > > (fast) block device that can be physically co-located with DB or WAL
> > > devices.
> >
> > Indeed!  KV traffic (specifically Rocksdb compaction) is probably my #1
> concern right now for Bluestore.  Based on what we saw when separating
> data out into different column families, about 2/3 of the compaction traffic
> was for object metadata and the other 1/3 for pglog.  WAL writes, at least in
> my setup, were leaking very minimally into the higher levels and was less of a
> concern.
> >
> > > Let's start from onode meta data only for the sake of simplicity.
> > > We have somewhat 4K - 64K and even more meta data per single onode
> > > for 4M object.  That includes user attrs, csum info, logical to
> > > physical extent mappings etc. This information is updated (partially
> > > or
> > > totally) on each write. The idea is to save that info to fast block
> > > device by direct use of an additional BlockDevice instance. E.g. one
> > > can allocate additional partition sharing the same physical device with DB
> for that.
> >
> > The big problem we have right now is the combination of all of that per-IO
> metadata compounded with the write-amp / compaction overhead in
> RocksDB.  I think it's going to take a concerted effort to further audit what we
> are writing out (like determining if we really need things like the per-blob
> header info as Varada mentioned) along with getting that data out of
> rocksdb.  Whether that means simply switching to something like zetascale,
> or the idea you are proposing here I'm not sure.
> > Hopefully I will be testing Somnath's zetascale branch soon.
> >
> > > Instead of full onode representation KV DB will contain allocated
> > > physical extents layout for this meta data similar to blob pextents
> > > vector on per-onode basis - i.e. some indexing info.  Plus some
> > > minor data too if needed. Additionally KV to hold free space
> > > tracking info from the second FreeList manager for fast block device.
> > > When saving onode bluestore has to allocate required space at fast
> > > device, mark old extents for release, write both onode and user data
> > > to block devices (in parallel) and update a db with space allocations. I.e.
> > > meta data overwrite procedure starts to resemble user data overwrite.
> > >
> > > Similar idea can be applied for WAL - one can store user data to
> > > fast device directly and update indexing information in KV only.
> >
> > I don't think the WAL is actually hurting us much right now fwiw.
> >
> > >
> > > Indexing information is pretty short and perhaps one should read it
> > > into memory on store mount and do not retrieve from DB during the
> operation.
> 
> Any strategy that assumes we put everything in RAM is going to be
> problematic for some workloads.  We can't assume objects are 4MB.  RGW
> users, for example, are free to upload nothing for 4K objects if they like.
> 
> > > This way DB traffic is reduced considerably and hence compaction
> > > will happen less frequently. Moreover we might probably remove var
> > > encoding stuff since we can be less careful about serialized onode size
> from now on.
> >
> > I think your proposal is heading in the right direction if we intend to stick
> with rocksdb.  If we decide to move toward zetascale, we'll need to think
> about some kind of WAL solution, but I'm less sure if we need to move data
> out of the KV store.  Maybe Somnath or Allen will comment.
> > I agree regarding varint encoding regardless of what we do, I'd much rather
> at least batch into prefixVarint or even just focus on compression of the 0/1
> case.
> >
> > >
> > > There is some POC code located at
> > > https://github.com/ifed01/ceph/tree/wip_bluestore_minimize_db
> > > POC code lacks WAL support, index retrieval on startup, var encoding
> > > elimination at the moment.
> > > Performance testing are still in progress.
> > >
> > > Any thoughts/comments?
> >
> > I'd be happy to take it for a test drive once I've got some of the zetascale
> testing done.  There's plenty of merit for trying it at least.
> >
> > >
> > > Thanks,
> > > Igor
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > > info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> > PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby notified
> that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly prohibited. If
> you have received this communication in error, please notify the sender by
> telephone or e-mail (as shown above) immediately and destroy any and all
> copies of this message in your possession (whether hard copies or
> electronically stored copies).
> > N?????r??y??????X??ǧv???)޺{.n?????z?]z????ay?ʇڙ??j
> ??f???h??????w???
> 
> ???j:+v???w???????? ????zZ+???????j"????i
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f