Hi All,
With Igor's patch last week I was able to get some bluestore performance
runs in without segfaulting and started looking int the results.
Somewhere along the line we really screwed up read performance, but
that's another topic. Right now I want to focus on random writes.
Before we put the onode on a diet we were seeing massive amounts of read
traffic in RocksDB during compaction that caused write stalls during 4K
random writes. Random write performance on fast hardware like NVMe
devices was often below filestore at anything other than very large IO
sizes. This was largely due to the size of the onode compounded with
RocksDB's tendency toward read and write amplification.
The new test results look very promising. We've dramatically improved
performance of random writes at most IO sizes, so that they are now
typically quite a bit higher than both filestore and older bluestore
code. Unfortunately for very small IO sizes performance hasn't improved
much. We are no longer seeing huge amounts of RocksDB read traffic and
fewer write stalls. We are however seeing huge memory usage (~9GB RSS
per OSD) and very high CPU usage. I think this confirms some of the
memory issues somnath was continuing to see. I don't think it's a leak
exactly based on how the OSDs were behaving, but we need to run through
massif still to be sure.
I ended up spending some time tonight with perf and digging through the
encode code. I wrote up some notes with graphs and code snippets and
decided to put them up on the web. Basically some of the encoding
changes we implemented last month to reduce the onode size also appear
to result in more buffer::list appends and the associated overhead.
I've been trying to think through ways to improve the situation and
thought other people might have some ideas too. Here's a link to the
short writeup:
https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing
Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html