Re: bluestore onode diet and encoding overhead

Samuel Just <sjust@xxxxxxxxxx> · Wed, 13 Jul 2016 07:47:50 -0700



I think that seems like a good way to go!
-Sam

On Tue, Jul 12, 2016 at 6:50 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
> On Tue, 12 Jul 2016, Allen Samuels wrote:
>> Good analysis.
>>
>> My original comments about putting the oNode on a diet included the idea
>> of a "custom" encode/decode path for certain high-usage cases. At the
>> time, Sage resisted going down that path hoping that a more optimized
>> generic case would get the job done. Your analysis shows that while
>> we've achieved significant space reduction this has come at the expense
>> of CPU time -- which dominates small object performance (I suspect that
>> eventually we'd discover that the variable length decode path would be
>> responsible for a substantial read performance degradation also -- which
>> may or may not be part of the read performance drop-off that you're
>> seeing). This isn't a surprising result, though it is unfortunate.
>>
>> I believe we need to revisit the idea of custom encode/decode paths for
>> high-usage cases, only now the gains need to be focused on CPU
>> utilization as well as space efficiency.
>
> I still think we can get most or all of the way there in a generic way by
> revising the way that we interact with bufferlist for encode and decode.
> We haven't actually tried to optimize this yet, and the current code is
> pretty horribly inefficient (asserts all over the place, and many layers
> of pointer indirection to do a simple append).  I think we need to do two
> things:
>
> 1) decode path: optimize the iterator class so that it has a const char
> *current and const char *current_end that point into the current
> buffer::ptr.  This way any decode will have a single pointer
> add+comparison to ensure there is enough data to copy before falling into
> the slow path (partial buffer, move to next buffer, etc.).
>
> 2) Having that comparison is still not ideal, but we shoudl consider ways
> to get around that too.  For example, if we know that we are going to
> decode N M-byte things, we could do an iterator 'reserve' or 'check' that
> ensures we have a valid pointer for that much and then proceed without
> checks.  The interface here would be tricky, though, since in the slow
> case we'll span buffers and need to magically fall back to a different
> decode path (hard to maintain) or do a temporary copy (probably faster but
> we need to ensure the iterator owns it and frees is later).  I'd say this
> is step 2 and optional; step 1 will have the most benefit.
>
> 3) encode path: currently all encode methods take a bufferlist& and the
> bufferlist itself as an append buffer.  I think this is flawed and
> limiting.  Instead, we should make a new class called
> buffer::list::appender (or similar) and templatize the encode methods so
> they can take a safe_appender (which does bounds checking) or an
> unsafe_appender (which does not).  For the latter, the user takes
> responsibility for making sure there is enough space by doing a reserve()
> type call which returns an unsafe_appender, and it's their job to make
> sure they don't shove too much data into it.  That should make the encode
> path a memcpy + ptr increment (for savvy/optimized callers).
>
> I suggest we use bluestore as a test case to make the interfaces work and
> be fast.  If we succeed we can take advantage of it across the reset of
> the code base as well.
>
> That's my thinking, at least.  I haven't had time to prototype it out yet,
> but I think our goal should be to make the encode/decode paths capable of
> being a memcpy + ptr addition in the fast path, and let that guide the
> interface...
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html