On Tue, May 3, 2016 at 10:17 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Tue, 3 May 2016, Mark Nelson wrote: >> On 05/03/2016 11:41 AM, Gregory Farnum wrote: >> > On Tue, May 3, 2016 at 6:34 AM, Mark Nelson <mnelson@xxxxxxxxxx> wrote: >> > > On 05/02/2016 02:00 PM, Howard Chu wrote: >> > > > >> > > > Sage Weil wrote: >> > > > > >> > > > > 1) Thoughts on moving to rocksdb in general? >> > > > >> > > > >> > > > Are you actually prepared to undertake all of the measurement and tuning >> > > > required to make RocksDB actually work well? You're switching from an >> > > > (abandoned/unsupported) engine with only a handful of config parameters >> > > > to one with ~40-50 params, all of which have critical but unpredictable >> > > > impact on resource consumption and performance. >> > > > >> > > >> > > You are absolutely correct, and there are definitely pitfalls we need to >> > > watch out for with the number of tunables in rocksdb. At least on the >> > > performance side two of the big issues we've hit with leveldb compaction >> > > related. In some scenarios compaction happens slower than the number of >> > > writes coming in resulting in ever-growing db sizes. The other issue is >> > > that compaction is single threaded and this can cause stalls and general >> > > mayhem when things get really heavily loaded. My hope is that if we do go >> > > with rocksdb, even in a sub-optimally tuned state, we'll be better off >> > > than >> > > we were with leveldb. >> > > >> > > We did some very preliminary benchmarks a couple of years ago (admittedly >> > > a >> > > too-small dataset size) basically comparing the (at the time) stock ceph >> > > leveldb settings vs rocksdb. On this set size, leveldb looked much better >> > > for reads, but much worse for writes. >> > >> > That's actually a bit troubling — many of our monitor problems have >> > arisen from slow reads, rather than slow writes. I suspect we want to >> > eliminate this before switching, if it's a concern. >> > >> > ...Although I think I did see a monitor caching layer go by, so maybe >> > it's a moot point now? >> >> Yeah, I suspect that's helping significantly. I think based at least one what >> I remember seeing I'm more concerned about high latency events than average >> read performance though. IE if there is a compaction storm, which store is >> going to handle it more gracefully with less spikey behavior? > > I'm most worried about the read storm that happens on each commit to fetch > all the just-updated PG stat keys. The other data in the mon is just > noise in comparison, I think, with the exception of the OSDMaps... which > IIRC is what the cache you mention was for. > > The initial PR, > > https://github.com/ceph/ceph/pull/8888 > > just makes the backend choice persistent. Rocksdb is still experimental. > There's an accompanying ceph-qa-suite pr so that we test both. Once we > do some performance evaluation we can decide whether the switch is safe > as-is, if more work (caching layer or tuning) is needed, or if it's a bad > idea. > >> In those leveldb tests we only saw writes and write trims hit by those >> periodic 10-60 second high-latency spikes, but if I recall the mon has (or at >> least had?) a global lock where write stalls would basically make the whole >> monitor stall. I think Joao might have improved that after we did this >> testing but I don't remember the details at this point. > > I don't think any of this locking has changed... The paxos state machine is no longer blocked for reads while an unrelated write is happening. Nor are older-version reads on the writing subsystem. That fix is post-firefly, right? -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html