Re: mon switch from leveldb to rocksdb

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 3 May 2016 12:01:19 -0500

On 05/03/2016 11:41 AM, Gregory Farnum wrote:
On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote:
On 05/02/2016 02:00 PM, Howard Chu wrote:

Sage Weil wrote:

1) Thoughts on moving to rocksdb in general?

Are you actually prepared to undertake all of the measurement and tuning
required to make RocksDB actually work well? You're switching from an
(abandoned/unsupported) engine with only a handful of config parameters
to one with ~40-50 params, all of which have critical but unpredictable
impact on resource consumption and performance.

You are absolutely correct, and there are definitely pitfalls we need to
watch out for with the number of tunables in rocksdb.  At least on the
performance side two of the big issues we've hit with leveldb compaction
related.  In some scenarios compaction happens slower than the number of
writes coming in resulting in ever-growing db sizes.  The other issue is
that compaction is single threaded and this can cause stalls and general
mayhem when things get really heavily loaded.  My hope is that if we do go
with rocksdb, even in a sub-optimally tuned state, we'll be better off than
we were with leveldb.

We did some very preliminary benchmarks a couple of years ago (admittedly a
too-small dataset size) basically comparing the (at the time) stock ceph
leveldb settings vs rocksdb.  On this set size, leveldb looked much better
for reads, but much worse for writes.

That's actually a bit troubling — many of our monitor problems have
arisen from slow reads, rather than slow writes. I suspect we want to
eliminate this before switching, if it's a concern.

...Although I think I did see a monitor caching layer go by, so maybe
it's a moot point now?

Yeah, I suspect that's helping significantly.  I think based at least 
one what I remember seeing I'm more concerned about high latency events 
than average read performance though.  IE if there is a compaction 
storm, which store is going to handle it more gracefully with less 
spikey behavior?

In those leveldb tests we only saw writes and write trims hit by those 
periodic 10-60 second high-latency spikes, but if I recall the mon has 
(or at least had?) a global lock where write stalls would basically make 
the whole monitor stall.  I think Joao might have improved that after we 
did this testing but I don't remember the details at this point.

-Greg

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html