Good analysis. My original comments about putting the oNode on a diet included the idea of a "custom" encode/decode path for certain high-usage cases. At the time, Sage resisted going down that path hoping that a more optimized generic case would get the job done. Your analysis shows that while we've achieved significant space reduction this has come at the expense of CPU time -- which dominates small object performance (I suspect that eventually we'd discover that the variable length decode path would be responsible for a substantial read performance degradation also -- which may or may not be part of the read performance drop-off that you're seeing). This isn't a surprising result, though it is unfortunate. I believe we need to revisit the idea of custom encode/decode paths for high-usage cases, only now the gains need to be focused on CPU utilization as well as space efficiency. I believe this activity can also address some of the memory consumption issues that we're seeing now. I believe that the current lextent/blob/pextent usage of standard STL maps is both space and time inefficient -- in a place where it matters a lot. Sage has already discussed usage of something like flat_map from the boost library as a way to reduce the memory overhead, etc. I believe this is the right direction. Where are we on getting boost into our build? Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson > Sent: Tuesday, July 12, 2016 12:03 AM > To: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: bluestore onode diet and encoding overhead > > Hi All, > > With Igor's patch last week I was able to get some bluestore performance > runs in without segfaulting and started looking int the results. > Somewhere along the line we really screwed up read performance, but > that's another topic. Right now I want to focus on random writes. > Before we put the onode on a diet we were seeing massive amounts of read > traffic in RocksDB during compaction that caused write stalls during 4K > random writes. Random write performance on fast hardware like NVMe > devices was often below filestore at anything other than very large IO sizes. > This was largely due to the size of the onode compounded with RocksDB's > tendency toward read and write amplification. > > The new test results look very promising. We've dramatically improved > performance of random writes at most IO sizes, so that they are now > typically quite a bit higher than both filestore and older bluestore code. > Unfortunately for very small IO sizes performance hasn't improved much. > We are no longer seeing huge amounts of RocksDB read traffic and fewer > write stalls. We are however seeing huge memory usage (~9GB RSS per > OSD) and very high CPU usage. I think this confirms some of the memory > issues somnath was continuing to see. I don't think it's a leak exactly based > on how the OSDs were behaving, but we need to run through massif still to > be sure. > > I ended up spending some time tonight with perf and digging through the > encode code. I wrote up some notes with graphs and code snippets and > decided to put them up on the web. Basically some of the encoding changes > we implemented last month to reduce the onode size also appear to result in > more buffer::list appends and the associated overhead. > I've been trying to think through ways to improve the situation and thought > other people might have some ideas too. Here's a link to the short writeup: > > https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?us > p=sharing > > Thanks, > Mark > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the > body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at > http://vger.kernel.org/majordomo-info.html ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f