> -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Saturday, August 13, 2016 2:32 PM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx> > Cc: ceph-devel@xxxxxxxxxxxxxxx > Subject: RE: bufferlist appenders > > On Fri, 12 Aug 2016, Allen Samuels wrote: > > > > > -----Original Message----- > > > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel- > > > owner@xxxxxxxxxxxxxxx] On Behalf Of Sage Weil > > > Sent: Friday, August 12, 2016 7:27 AM > > > To: ceph-devel@xxxxxxxxxxxxxxx > > > Subject: bufferlist appenders > > > > > > A ton of time is the encoding/marshalling is spent doing bufferlist > appends. > > > This is partly because the buffer code is doing lots of sanity range checks, > and > > > party because there are multiple layers that get range checks and length > > > updates (bufferlist _len changes, and bufferlist::append_buffer (a ptr) > gets > > > it's length updated, at the very least). > > > > > > To simplify and speed this up, I propose an 'appender' concept/type that > is > > > used for doing appends in a more efficient way. It would be used like so: > > > > > > bufferlist bl; > > > { > > > bufferlist::safe_appender a = bl.get_safe_appender(); > > > ::encode(foo, a); > > > } > > > > > > or > > > > > > { > > > bufferlist::unsafe_appender a = bl.get_unsafe_appender(1024); > > > ::encode(foo, a); > > > } > > > > > > The appender keeps its own bufferptr that it copies data into. The > bufferptr > > > isn't given to the bufferlist until the appender is destroyed (or flush() is > called > > > explicitly). This means that appends are generally just a memcpy and a > > > position pointer addition. In the safe_appender case, we also do a range > > > change and allocate a new buffer if necessary. In the unsafe_appender > case, > > > it is the callers responsibility to say how big a buffer is preallocated. > > > > > > I have a simple prototype here: > > > > > > https://github.com/ceph/ceph/pull/10700 > > > > > > It appears to be almost 10x faster when encoding a uint64_t in a loop! > > > > > > [ RUN ] BufferList.appender_bench > > > appending 1073741824 bytes > > > buffer::list::append 20.285963 > > > buffer::list encode 19.719120 > > > buffer::list::safe_appender::append 2.588926 > > > buffer::list::safe_appender::append_v 2.837026 > buffer::list::safe_appender > > > encode 3.000614 buffer::list::unsafe_appender::append 2.452116 > > > buffer::list::unsafe_appender::append_v 2.553745 > > > buffer::list::unsafe_appender encode 2.200110 > > > [ OK ] BufferList.appender_bench (55637 ms) > > > > > > Interesting, unsafe isn't much faster than safe. I suspect the CPU's > branch > > > prediction is just working really well there? > > > > Yes it looks like it is. But this is a bit of a contrived case because > > you're timing this code which is 100% in the L1 cache, which might not > > be a good model. How bad does it get if the loop is unrolled enough > > times to fall out of the L1 cache but still in the L2 cache (which might > > be as pessimistic a simulation as the original code is optimistic). > > I updated the test to pick a random point in a big array and then encode > 16 items inline/unrolled in the hopes that this would avoid just hammering > L1. It does mean that the 16 contiguous items will probably all get > fetched together (or have the fetch pipelined), but I think that's > probably pretty realistic. > > The test also went through several other revisions and refinements > yesterday with Igor's review. Here's the result now: > > [ RUN ] BufferList.appender_bench > appending 268435456 uint64_t's > buffer::list::append 7.142146 > buffer::list encode 7.968760 > buffer::list::safe_appender::append 3.506399 > buffer::list::safe_appender::append (reserve 16) 2.291478 > buffer::list::safe_appender::append_v 3.673429 > buffer::list::safe_appender::append_v (reserve 16) 2.291800 > buffer::list::safe_appender encode 3.673988 > buffer::list::unsafe_appender::append 1.766341 > buffer::list::unsafe_appender::append_v 1.748361 > buffer::list::unsafe_appender encode 1.746126 > [ OK ] BufferList.appender_bench (36340 ms) > > which looks much more like what I would expect to see. The 'reserve 16' > cases are doing a single bounds check and then doing 16 unsafe appends. > Bottom line is that doing the limited bounds checks gets you most of the > way to totally unguarded, but not all the way. And no guards is > definitely faster (roughly 2x faster than a guard for every word). > > Anyway, I think the next step is to look and the enc_dec and see if > this can complement the approach... > > sage I think this is still too optimistic. My comments about the L1 cache were intended to be oriented toward the code cache, not the data cache (sorry, I should have been more specific -- communication IS difficult). Whatever the data cache access pattern is, it'll be more or less the same between all of the methods -- so I think we can ignore it [I agree that having the data in the L2 cache is the right thing to simulate]. However, for the code cache, in the safe_appender case you have the code for "if (pos + l > end) { flush(); prepare_buffer(1); }; memcpy(); iter+=..."; vs. the code for the enc_dec/unsafe_appender case which is just "memcpy(); iter += ....". is significantly degraded. It's more likely that the performance of the two schemes is a function of the code density (where enc_dec/ unsafe_appender wins big time). But I agree that the thing to strive for is hybrid of the two approaches. It looks to me like if we married the unsafe_appender to the unified estimate/encode paradigm of enc_dec we would have the best of both worlds. I have to run, but I suspect that having unsafe_appender take as a parameter the pair of estimate/encode functions and that those functions are templated/generated like what I did in the enc_dec framework we would pretty much be there. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html