RE: bluestore onode diet and encoding overhead

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Tue, 12 Jul 2016 21:15:24 +0000

Great papers!

Your profiling pretty much shows that the problem is really the buffer::list stuff and not the encoding itself (at least not yet!)

Yes, it's relatively each to fix the buffer encoding. You just have to over-allocate (do a worst-case computation for the data), and then do the encoding into the over-allocated chunk and then free up the unused portion.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Tuesday, July 12, 2016 8:38 AM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-
> devel@xxxxxxxxxxxxxxx>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> 
> 
> On 07/12/2016 10:20 AM, Allen Samuels wrote:
> > Good analysis.
> >
> > My original comments about putting the oNode on a diet included the idea
> of a "custom" encode/decode path for certain high-usage cases. At the time,
> Sage resisted going down that path hoping that a more optimized generic
> case would get the job done. Your analysis shows that while we've achieved
> significant space reduction this has come at the expense of CPU time -- which
> dominates small object performance (I suspect that eventually we'd discover
> that the variable length decode path would be responsible for a substantial
> read performance degradation also -- which may or may not be part of the
> read performance drop-off that you're seeing). This isn't a surprising result,
> though it is unfortunate.
> >
> > I believe we need to revisit the idea of custom encode/decode paths for
> high-usage cases, only now the gains need to be focused on CPU utilization
> as well as space efficiency.
> 
> I'm not against it, but it might be worth at least a quick attempt at
> preallocating the append_buffer and/or Piotr's idea to directly memcpy
> without doing the append at all.  It may be that helps quite a bit (though
> perhaps it's not enough in the long run).
> 
> A couple of other thoughts:
> 
> I still think SIMD encode approaches are interesting if we can lay data out in
> memory in a friendly way (This feels like it might be painful
> though):
> 
> http://arxiv.org/abs/1209.2137
> 
> But on the other hand, Kenton Varda who was previously a primary author
> on google's protocol buffers ended up doing something a little different than
> varint:
> 
> https://capnproto.org/encoding.html
> 
> Look specifically at the packing section.  It looks somewhat attractive to me.
> 
> Mark
> 
> >
> > I believe this activity can also address some of the memory consumption
> issues that we're seeing now. I believe that the current lextent/blob/pextent
> usage of standard STL maps is both space and time inefficient -- in a place
> where it matters a lot. Sage has already discussed usage of something like
> flat_map from the boost library as a way to reduce the memory overhead,
> etc. I believe this is the right direction.
> >
> > Where are we on getting boost into our build?
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, Milpitas, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> >
> >> -----Original Message-----
> >> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
> >> owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
> >> Sent: Tuesday, July 12, 2016 12:03 AM
> >> To: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> >> Subject: bluestore onode diet and encoding overhead
> >>
> >> Hi All,
> >>
> >> With Igor's patch last week I was able to get some bluestore performance
> >> runs in without segfaulting and started looking int the results.
> >> Somewhere along the line we really screwed up read performance, but
> >> that's another topic.  Right now I want to focus on random writes.
> >> Before we put the onode on a diet we were seeing massive amounts of
> read
> >> traffic in RocksDB during compaction that caused write stalls during 4K
> >> random writes.  Random write performance on fast hardware like NVMe
> >> devices was often below filestore at anything other than very large IO
> sizes.
> >> This was largely due to the size of the onode compounded with RocksDB's
> >> tendency toward read and write amplification.
> >>
> >> The new test results look very promising.  We've dramatically improved
> >> performance of random writes at most IO sizes, so that they are now
> >> typically quite a bit higher than both filestore and older bluestore code.
> >> Unfortunately for very small IO sizes performance hasn't improved much.
> >> We are no longer seeing huge amounts of RocksDB read traffic and fewer
> >> write stalls.  We are however seeing huge memory usage (~9GB RSS per
> >> OSD) and very high CPU usage.  I think this confirms some of the memory
> >> issues somnath was continuing to see.  I don't think it's a leak exactly
> based
> >> on how the OSDs were behaving, but we need to run through massif still
> to
> >> be sure.
> >>
> >> I ended up spending some time tonight with perf and digging through the
> >> encode code.  I wrote up some notes with graphs and code snippets and
> >> decided to put them up on the web.  Basically some of the encoding
> changes
> >> we implemented last month to reduce the onode size also appear to
> result in
> >> more buffer::list appends and the associated overhead.
> >> I've been trying to think through ways to improve the situation and
> thought
> >> other people might have some ideas too.  Here's a link to the short
> writeup:
> >>
> >>
> https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?us
> >> p=sharing
> >>
> >> Thanks,
> >> Mark
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the
> >> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info
> at
> >> http://vger.kernel.org/majordomo-info.html
��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f