On Tue, 3 May 2016, Mark Nelson wrote: > On 05/03/2016 11:41 AM, Gregory Farnum wrote: > > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: > > > On 05/02/2016 02:00 PM, Howard Chu wrote: > > > > > > > > Sage Weil wrote: > > > > > > > > > > 1) Thoughts on moving to rocksdb in general? > > > > > > > > > > > > Are you actually prepared to undertake all of the measurement and tuning > > > > required to make RocksDB actually work well? You're switching from an > > > > (abandoned/unsupported) engine with only a handful of config parameters > > > > to one with ~40-50 params, all of which have critical but unpredictable > > > > impact on resource consumption and performance. > > > > > > > > > > You are absolutely correct, and there are definitely pitfalls we need to > > > watch out for with the number of tunables in rocksdb. At least on the > > > performance side two of the big issues we've hit with leveldb compaction > > > related. In some scenarios compaction happens slower than the number of > > > writes coming in resulting in ever-growing db sizes. The other issue is > > > that compaction is single threaded and this can cause stalls and general > > > mayhem when things get really heavily loaded. My hope is that if we do go > > > with rocksdb, even in a sub-optimally tuned state, we'll be better off > > > than > > > we were with leveldb. > > > > > > We did some very preliminary benchmarks a couple of years ago (admittedly > > > a > > > too-small dataset size) basically comparing the (at the time) stock ceph > > > leveldb settings vs rocksdb. On this set size, leveldb looked much better > > > for reads, but much worse for writes. > > > > That's actually a bit troubling — many of our monitor problems have > > arisen from slow reads, rather than slow writes. I suspect we want to > > eliminate this before switching, if it's a concern. > > > > ...Although I think I did see a monitor caching layer go by, so maybe > > it's a moot point now? > > Yeah, I suspect that's helping significantly. I think based at least one what > I remember seeing I'm more concerned about high latency events than average > read performance though. IE if there is a compaction storm, which store is > going to handle it more gracefully with less spikey behavior? I'm most worried about the read storm that happens on each commit to fetch all the just-updated PG stat keys. The other data in the mon is just noise in comparison, I think, with the exception of the OSDMaps... which IIRC is what the cache you mention was for. The initial PR, https://github.com/ceph/ceph/pull/8888 just makes the backend choice persistent. Rocksdb is still experimental. There's an accompanying ceph-qa-suite pr so that we test both. Once we do some performance evaluation we can decide whether the switch is safe as-is, if more work (caching layer or tuning) is needed, or if it's a bad idea. > In those leveldb tests we only saw writes and write trims hit by those > periodic 10-60 second high-latency spikes, but if I recall the mon has (or at > least had?) a global lock where write stalls would basically make the whole > monitor stall. I think Joao might have improved that after we did this > testing but I don't remember the details at this point. I don't think any of this locking has changed... sage