Re: bufferlist appenders

Mark Nelson <mnelson@xxxxxxxxxx> · Sun, 14 Aug 2016 08:06:52 -0500

On 08/13/2016 04:31 PM, Sage Weil wrote:
On Fri, 12 Aug 2016, Allen Samuels wrote:

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-
owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil
Sent: Friday, August 12, 2016 7:27 AM
To: ceph-devel@xxxxxxxxxxxxxxx
Subject: bufferlist appenders

A ton of time is the encoding/marshalling is spent doing bufferlist appends.
This is partly because the buffer code is doing lots of sanity range checks, and
party because there are multiple layers that get range checks and length
updates (bufferlist _len changes, and bufferlist::append_buffer (a ptr) gets
it's length updated, at the very least).

To simplify and speed this up, I propose an 'appender' concept/type that is
used for doing appends in a more efficient way.  It would be used like so:

 bufferlist bl;
 {
   bufferlist::safe_appender a = bl.get_safe_appender();
   ::encode(foo, a);
 }

or

 {
   bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024);
   ::encode(foo, a);
 }

The appender keeps its own bufferptr that it copies data into.  The bufferptr
isn't given to the bufferlist until the appender is destroyed (or flush() is called
explicitly).  This means that appends are generally just a memcpy and a
position pointer addition.  In the safe_appender case, we also do a range
change and allocate a new buffer if necessary.  In the unsafe_appender case,
it is the callers responsibility to say how big a buffer is preallocated.

I have a simple prototype here:

	https://github.com/ceph/ceph/pull/10700

It appears to be almost 10x faster when encoding a uint64_t in a loop!

[ RUN      ] BufferList.appender_bench
appending 1073741824 bytes
buffer::list::append 20.285963
buffer::list encode 19.719120
buffer::list::safe_appender::append 2.588926
buffer::list::safe_appender::append_v 2.837026 buffer::list::safe_appender
encode 3.000614 buffer::list::unsafe_appender::append 2.452116
buffer::list::unsafe_appender::append_v 2.553745
buffer::list::unsafe_appender encode 2.200110
[       OK ] BufferList.appender_bench (55637 ms)

Interesting, unsafe isn't much faster than safe.  I suspect the CPU's branch
prediction is just working really well there?

Yes it looks like it is. But this is a bit of a contrived case because
you're timing this code which is 100% in the L1 cache, which might not
be a good model. How bad does it get if the loop is unrolled enough
times to fall out of the L1 cache but still in the L2 cache (which might
be as pessimistic a simulation as the original code is optimistic).

I updated the test to pick a random point in a big array and then encode
16 items inline/unrolled in the hopes that this would avoid just hammering
L1.  It does mean that the 16 contiguous items will probably all get
fetched together (or have the fetch pipelined), but I think that's
probably pretty realistic.

The test also went through several other revisions and refinements
yesterday with Igor's review.  Here's the result now:

[ RUN      ] BufferList.appender_bench
appending 268435456 uint64_t's
buffer::list::append 7.142146
buffer::list encode 7.968760
buffer::list::safe_appender::append 3.506399
buffer::list::safe_appender::append (reserve 16) 2.291478
buffer::list::safe_appender::append_v 3.673429
buffer::list::safe_appender::append_v (reserve 16) 2.291800
buffer::list::safe_appender encode 3.673988
buffer::list::unsafe_appender::append 1.766341
buffer::list::unsafe_appender::append_v 1.748361
buffer::list::unsafe_appender encode 1.746126
[       OK ] BufferList.appender_bench (36340 ms)

which looks much more like what I would expect to see.  The 'reserve 16'
cases are doing a single bounds check and then doing 16 unsafe appends.
Bottom line is that doing the limited bounds checks gets you most of the
way to totally unguarded, but not all the way.  And no guards is
definitely faster (roughly 2x faster than a guard for every word).

Agreed, this seems much more realistic than the earlier results.  It's 
interesting how much the bounds check in the reserve 16 case hurts, but 
in any event we are far better off than the default append case!  Really 
good stuff here!

Anyway, I think the next step is to look and the enc_dec and see if
this can complement the approach...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html