Re: bluestore onode diet and encoding overhead

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 12 Jul 2016 17:04:22 -0500

On 07/12/2016 04:15 PM, Allen Samuels wrote:
Great papers!

Both are backed by open source code on github, which was some of my 
motivation for looking at them.  The SIMD encoding paper only deals with 
32bit ints afaik, but Cap'n Protocol looks pretty robust/convenient out 
of the box.

Your profiling pretty much shows that the problem is really the buffer::list stuff and not the encoding itself (at least not yet!)

Yes, it's relatively each to fix the buffer encoding. You just have to over-allocate (do a worst-case computation for the data), and then do the encoding into the over-allocated chunk and then free up the unused portion.

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
Sent: Tuesday, July 12, 2016 8:38 AM
To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-
devel@xxxxxxxxxxxxxxx>
Subject: Re: bluestore onode diet and encoding overhead

On 07/12/2016 10:20 AM, Allen Samuels wrote:
Good analysis.

My original comments about putting the oNode on a diet included the idea
of a "custom" encode/decode path for certain high-usage cases. At the time,
Sage resisted going down that path hoping that a more optimized generic
case would get the job done. Your analysis shows that while we've achieved
significant space reduction this has come at the expense of CPU time -- which
dominates small object performance (I suspect that eventually we'd discover
that the variable length decode path would be responsible for a substantial
read performance degradation also -- which may or may not be part of the
read performance drop-off that you're seeing). This isn't a surprising result,
though it is unfortunate.

I believe we need to revisit the idea of custom encode/decode paths for
high-usage cases, only now the gains need to be focused on CPU utilization
as well as space efficiency.

I'm not against it, but it might be worth at least a quick attempt at
preallocating the append_buffer and/or Piotr's idea to directly memcpy
without doing the append at all.  It may be that helps quite a bit (though
perhaps it's not enough in the long run).

A couple of other thoughts:

I still think SIMD encode approaches are interesting if we can lay data out in
memory in a friendly way (This feels like it might be painful
though):

http://arxiv.org/abs/1209.2137

But on the other hand, Kenton Varda who was previously a primary author
on google's protocol buffers ended up doing something a little different than
varint:

https://capnproto.org/encoding.html

Look specifically at the packing section.  It looks somewhat attractive to me.

Mark

I believe this activity can also address some of the memory consumption
issues that we're seeing now. I believe that the current lextent/blob/pextent
usage of standard STL maps is both space and time inefficient -- in a place
where it matters a lot. Sage has already discussed usage of something like
flat_map from the boost library as a way to reduce the memory overhead,
etc. I believe this is the right direction.

Where are we on getting boost into our build?

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, Milpitas, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Mark Nelson
Sent: Tuesday, July 12, 2016 12:03 AM
To: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
Subject: bluestore onode diet and encoding overhead

Hi All,

With Igor's patch last week I was able to get some bluestore performance
runs in without segfaulting and started looking int the results.
Somewhere along the line we really screwed up read performance, but
that's another topic.  Right now I want to focus on random writes.
Before we put the onode on a diet we were seeing massive amounts of
read
traffic in RocksDB during compaction that caused write stalls during 4K
random writes.  Random write performance on fast hardware like NVMe
devices was often below filestore at anything other than very large IO
sizes.
This was largely due to the size of the onode compounded with RocksDB's
tendency toward read and write amplification.

The new test results look very promising.  We've dramatically improved
performance of random writes at most IO sizes, so that they are now
typically quite a bit higher than both filestore and older bluestore code.
Unfortunately for very small IO sizes performance hasn't improved much.
We are no longer seeing huge amounts of RocksDB read traffic and fewer
write stalls.  We are however seeing huge memory usage (~9GB RSS per
OSD) and very high CPU usage.  I think this confirms some of the memory
issues somnath was continuing to see.  I don't think it's a leak exactly
based
on how the OSDs were behaving, but we need to run through massif still
to
be sure.

I ended up spending some time tonight with perf and digging through the
encode code.  I wrote up some notes with graphs and code snippets and
decided to put them up on the web.  Basically some of the encoding
changes
we implemented last month to reduce the onode size also appear to
result in
more buffer::list appends and the associated overhead.
I've been trying to think through ways to improve the situation and
thought
other people might have some ideas too.  Here's a link to the short
writeup:

https://drive.google.com/file/d/0B2gTBZrkrnpZeC04eklmM2I4Wkk/view?us
p=sharing

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the
body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info
at
http://vger.kernel.org/majordomo-info.html
N�����r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w������j:+v���w�j�m��������zZ+�����ݢj"��!tml=

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html