bluestore onode diet and encoding overhead

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi All,

With Igor's patch last week I was able to get some bluestore performance runs in without segfaulting and started looking int the results. Somewhere along the line we really screwed up read performance, but that's another topic. Right now I want to focus on random writes. Before we put the onode on a diet we were seeing massive amounts of read traffic in RocksDB during compaction that caused write stalls during 4K random writes. Random write performance on fast hardware like NVMe devices was often below filestore at anything other than very large IO sizes. This was largely due to the size of the onode compounded with RocksDB's tendency toward read and write amplification.

The new test results look very promising. We've dramatically improved performance of random writes at most IO sizes, so that they are now typically quite a bit higher than both filestore and older bluestore code. Unfortunately for very small IO sizes performance hasn't improved much. We are no longer seeing huge amounts of RocksDB read traffic and fewer write stalls. We are however seeing huge memory usage (~9GB RSS per OSD) and very high CPU usage. I think this confirms some of the memory issues somnath was continuing to see. I don't think it's a leak exactly based on how the OSDs were behaving, but we need to run through massif still to be sure.

I ended up spending some time tonight with perf and digging through the encode code. I wrote up some notes with graphs and code snippets and decided to put them up on the web. Basically some of the encoding changes we implemented last month to reduce the onode size also appear to result in more buffer::list appends and the associated overhead. I've been trying to think through ways to improve the situation and thought other people might have some ideas too. Here's a link to the short writeup:

https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?usp=sharing

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux