RE: bluestore onode diet and encoding overhead

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Thu, 14 Jul 2016 16:34:00 +0000

Ideally, yes. Either it happens tomorrow or two weeks from now. 

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Thursday, July 14, 2016 9:32 AM
> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil
> <sweil@xxxxxxxxxx>
> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> So right now I'm knee deep in bisecting bluestore to track down our read
> regression, but if you want to throw together a PR that uses this for encode
> in bluestore I'd be certainly happy to give it a whirl on our test cluster.
> 
> Mark
> 
> On 07/14/2016 11:20 AM, Allen Samuels wrote:
> > BTW, I see this stuff as gradually replacing the existing encode/decode
> infrastructure. It's pretty easy to have them side-by-side as well as have the
> new infrastructure be wire-compatible with the current stuff. That'll allow a
> slow conversion from the old-style to the new-style.  The only that's really
> different between the two (on the wire) is that I proposed the new stuff to
> have a length prefix so that the decoder knows how much data to
> "straighten" before launching the fast decode (this is the equivalent of the
> ESTIMATE phase during encode). For old-style stuff that doesn't have the
> prefix, you'll have to "straighten" the entire remainder of the buffer -- this
> may limit the rate of conversion (in that you can only afford to covert code to
> the new style when you know that the overhead of straightening is
> affordable -- probably because you know that there's not much data present
> OR you assume that you're in a temporary transient environment.
> >
> >
> > Allen Samuels
> > SanDisk |a Western Digital brand
> > 2880 Junction Avenue, San Jose, CA 95134
> > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >
> >> -----Original Message-----
> >> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> >> Sent: Thursday, July 14, 2016 4:16 AM
> >> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil
> >> <sweil@xxxxxxxxxx>
> >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> >> Subject: Re: bluestore onode diet and encoding overhead
> >>
> >> On 07/14/2016 12:52 AM, Allen Samuels wrote:
> >>> As promised, here's some code that hacks out a new encode/decode
> >> framework. That has the advantage of only having to list the fields
> >> of a struct once and is pretty much guaranteed to never overrun a
> buffer....
> >>>
> >>> Comments are requested :)
> >>
> >> It compiles! :D
> >>
> >> I looked over the code, but I want to look it over again after I've
> >> had my coffee since I'm still shaking the cobwebs out.  Would the
> >> idea here be that if you are doing varint encoding for example that
> >> you always allocate the buffer based on ESTIMATE (also taking into
> >> account the encoding overhead), but typically expect a much smaller
> encoding?
> >>
> >> As it is, it's very clever.
> >>
> >> Mark
> >>
> >>>
> >>>
> >>> #include <iostream>
> >>> #include <fstream>
> >>> #include <set>
> >>> #include <string>
> >>> #include <string.h>
> >>>
> >>>
> /*******************************************************
> >>>
> >>>
> >>>    New fast encode/decode framework.
> >>>
> >>>    The entire framework is built around the idea that each object
> >>> has three
> >> operations:
> >>>
> >>>      ESTIMATE  -- worst-case estimate of the amount of storage
> >>> required for
> >> this object
> >>>      ENCODE    -- encode object into buffer of size ESTIMATE
> >>>      DECODE    -- encode object from buffer of size actual.
> >>>
> >>>    Each object has a single templated function that actually
> >>> provides all three
> >> operations in a single set of code.
> >>>    But doing this, it's pretty much guaranteed that the ESTIMATE and
> >>> the
> >> ENCODE code are in harmony (i.e. that the estimate is correct)
> >>>    it also saves a lot of typing/reading...
> >>>
> >>>    Generally, all three operations are provided on a single function
> >>> name
> >> with the input and return parameters overloaded to distinguish them.
> >>>
> >>>    It's observed that for each of the three operations there is a
> >>> single value
> >> which needs to be transmitted between each of the micro-
> encode/decode
> >> calls
> >>>    Yes, this is confusing, but let's look at a simple example
> >>>
> >>>     struct simple {
> >>>       int a;
> >>>       float b;
> >>>       string c;
> >>>       set<int> d;
> >>>     };
> >>>
> >>>     To encode this struct we generate a function that does the
> >>> micro-
> >> encoding of each of the fields of the struct
> >>>     Here's an example of a function that does the ESTIMATE operation.
> >>>
> >>>     size_t simple::estimate() {
> >>>        return
> >>>           sizeof(a) +
> >>>           sizeof(b) +
> >>>           c.size() +
> >>>           d.size() * sizeof(int);
> >>>     }
> >>>
> >>>     We're going to re-write it as:
> >>>
> >>>     size_t simple::estimate(size_t p) {
> >>>        p = estimate(p,a);
> >>>        p = estimate(p,b);
> >>>        p = estimate(p,c);
> >>>        p = estimate(p,d);
> >>>        return p;
> >>>     }
> >>>
> >>>     assuming that the sorta function:
> >>>
> >>>     template<typename t> size_t estimate(size_t p,t& o) { return p +
> >> sizeof(o); }
> >>>     template<typename t> size_t estimate(size_t p,set<t>& o) {
> >>> return p + o.size() * sizeof(t); }
> >>>
> >>>
> >>>     similarly, the encode operation is represented as:
> >>>
> >>>     char * simple::encode(char *p) {
> >>>        p = encode(p,a);
> >>>        p = encode(p,b);
> >>>        p = encode(p,c);
> >>>        p = encode(p,d);
> >>>        return p;
> >>>     }
> >>>
> >>>     similarly, the decode operation is represented as:
> >>>
> >>>     const char * simple::decode(const char *p) {
> >>>        p = decode(p,a);
> >>>        p = decode(p,b);
> >>>        p = decode(p,c);
> >>>        p = decode(p,d);
> >>>        return p;
> >>>     }
> >>>
> >>>
> >>> You can now see that it's possible to create a single function that
> >>> does all three operations in a single block of code, provided that
> >>> you can
> >> fiddle the input/output parameter types appropriately.
> >>>
> >>> In essence the pattern is
> >>>
> >>>     p = enc_dec(p,struct_field_1);
> >>>     p = enc_dec(p,struct_field_2);
> >>>     p = enc_dec(p,struct_field_3);
> >>>
> >>> With the type of p being set differently for each operation, i.e.,
> >>>     for ESTIMATE, p = size_t
> >>>     for ENCODE,   p = char *
> >>>     for DECODE,   p = const char *
> >>>
> >>> This is the essence of how the encode/decode framework operates.
> >> Though there is some more sophistication...
> >>>
> >>> ----------------------
> >>>
> >>> We also want to allow the encode/decode machinery to be per-type and
> >>> to operate
> >>>
> >>>
> >>
> **********************************************************
> >> ************
> >>> *******/
> >>>
> >>> using namespace std;
> >>>
> >>> //
> >>> // Just like the existing encode/decode machinery. The environment
> >>> provides a rich set of // pre-defined encodes for primitive types
> >>> and containers //
> >>>
> >>> #define DEFINE_ENC_DEC_RAW(type) \
> >>> inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> >>> inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> >> sizeof(type); } \
> >>> inline const char *enc_dec(const char *p,type &o) { o = *(const type
> >>> *)p; return p + sizeof(type); }
> >>>
> >>> DEFINE_ENC_DEC_RAW(int);
> >>> DEFINE_ENC_DEC_RAW(size_t);
> >>>
> >>> //
> >>> // String encode/decode (Yea, I know size_t isn't portable -- this
> >>> is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) {
> >>> return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char *
> >>> p,string& s) { *(size_t *)p = s.size();
> >>> memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p +
> >>> sizeof(size_t)
> >>> + s.size(); } inline const char *enc_dec(const char *p,string& s) {
> >>> + s
> >>> = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t)
> >>> + s.size(); }
> >>>
> >>> //
> >>> // Let's do a container.
> >>> //
> >>> // One of the problems with a container is that making an accurate
> >>> estimate of the size // would theoretically require that you walk
> >>> the entire
> >> container and add up the sizes of each element.
> >>> // We probably don't want to do that. So here, I do a hack that just
> >>> assumes that I can fake up a individual element // and multiple that
> >>> by the number of elements in a container. This hack works anytime
> >>> that the estimate function // for the contained type has a fixed
> maximum size.
> >> BTW, this is safe, if the contained type has a variable size //
> >> (like set<string>) then it will fault out the first time you run it.
> >>> //
> >>> // Naturally, something like set<string> or map<string,string> is a
> >>> highly desirable thing to be able to encode/decode // there's no
> >>> reason
> >> that you can't create a enc_dec_slow function that properly computes
> >> the maximum size by walking the container.
> >>> //
> >>> template<typename t>
> >>> inline size_t enc_dec(size_t p,set<t>& s) { return p +
> >>> sizeof(size_t)
> >>> + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> >>>
> >>> template<typename t>
> >>> inline char *enc_dec(char *p,set<t>& s) {
> >>>    size_t sz = s.size();
> >>>    p = enc_dec(p,sz);
> >>>    for (const t& e : s) {
> >>>       p = enc_dec(p,const_cast<t&>(e));
> >>>    }
> >>>    return p;
> >>> }
> >>>
> >>> template<typename t>
> >>> inline const char *enc_dec(const char *p,set<t>&s) {
> >>>    size_t sz;
> >>>    p = enc_dec(p,sz);
> >>>    while (sz--) {
> >>>       t temp;
> >>>       p = enc_dec(p,temp);
> >>>       s.insert(temp);
> >>>    }
> >>>    return p;
> >>> }
> >>>
> >>> //
> >>> // Specialized encode/decode for a single data type. These are
> >>> invoked
> >> explicitly...
> >>> //
> >>> inline size_t enc_dec_lba(size_t p,int& lba) {
> >>>    return p + sizeof(lba); // Max....
> >>> }
> >>>
> >>> inline char * enc_dec_lba(char *p,int& lba) {
> >>>    *p = 15;
> >>>    return p + 1; // blah blah
> >>> }
> >>>
> >>> inline const char *enc_dec_lba(const char *p,int& lba) {
> >>>    lba = *p;
> >>>    return p+1;
> >>> }
> >>>
> >>> //
> >>> // Specialized encode/decode for more sophisticated things primitives.
> >>> //
> >>> // Here's an example of a encode/decoder for a pair of fields //
> >>> inline size_t enc_dec_range(size_t p,short& start,short& end) {
> >>>    return p + 2 * sizeof(short);
> >>> }
> >>>
> >>> inline char *enc_dec_range(char *p, short& start, short& end) {
> >>>    short *s = (short *) p;
> >>>    s[0] = start;
> >>>    s[1] = end;
> >>>    return p + sizeof(short) * 2;
> >>> }
> >>>
> >>> inline const char *enc_dec_range(const char *p,short& start, short&
> end) {
> >>>    start = *(short *)p;
> >>>    end   = *(short *)(p + sizeof(short));
> >>>    return p + 2*sizeof(short);
> >>> }
> >>>
> >>>
> >>> //
> >>> // Some C++ template wizardry to make the single encode/decode
> >> function possible.
> >>> //
> >>> enum SERIAL_TYPE {
> >>>    ESTIMATE,
> >>>    ENCODE,
> >>>    DECODE
> >>> };
> >>>
> >>> template <enum SERIAL_TYPE s> struct serial_type;
> >>>
> >>> template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> >>> template<> struct serial_type<ENCODE>   { typedef char * type; };
> >>> template<> struct serial_type<DECODE>   { typedef const char *type; };
> >>>
> >>> //
> >>> // This macro is the key, it connects the external non-member
> >>> function to
> >> the correct member function.
> >>> //
> >>> #define DEFINE_STRUCT_ENC_DEC(s) \
> >>> inline size_t      enc_dec(size_t p, s &o) { return
> o.enc_dec<ESTIMATE>(p); }
> >> \
> >>> inline char *      enc_dec(char *p , s &o)  { return
> o.enc_dec<ENCODE>(p); }
> >> \
> >>> inline const char *enc_dec(const char *p,s &o)  { return
> >>> o.enc_dec<DECODE>(p); }
> >>>
> >>> //
> >>> // Our example structure
> >>> //
> >>> struct astruct {
> >>>    int a;
> >>>    set<int> b;
> >>>    int lba;
> >>>    short start,end;
> >>>
> >>>    //
> >>>    // <<<<< You need to provide this function just one.
> >>>    //
> >>>    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> >> enc_dec(typename serial_type<s>::type p) {
> >>>       p = ::enc_dec(p,a);
> >>>       p = ::enc_dec(p,b);
> >>>       p = ::enc_dec_lba(p,lba);
> >>>       p = ::enc_dec_range(p,start,end);
> >>>       return p;
> >>>    }
> >>> };
> >>>
> >>> //
> >>> // This macro connects the global enc_dec to the member function.
> >>> // One of these per struct declaration //
> >>> DEFINE_STRUCT_ENC_DEC(astruct);
> >>>
> >>>
> >>> //
> >>> // Here's a simple test program. The real encode/decode framework
> >>> needs to be connected to bufferlist using the pseudo-code // that I
> >> documented in my previous email.
> >>> //
> >>>
> >>> int main(int argc,char **argv) {
> >>>
> >>>    astruct a;
> >>>    a.a = 10;
> >>>    a.b.insert(2);
> >>>    a.b.insert(3);
> >>>    a.lba = 12;
> >>>
> >>>    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> >>>    cout << "Estimated size is " << s << "\n";
> >>>
> >>>    char buffer[100];
> >>>
> >>>    char *end = a.enc_dec<ENCODE>(buffer);
> >>>
> >>>    cout << "Actual storage was " << end-buffer << "\n";
> >>>
> >>>    astruct b;
> >>>
> >>>    (void) b.enc_dec<DECODE>(buffer); // decode it
> >>>
> >>>    cout << "A.a = " << b.a << "\n";
> >>>    for (auto e : b.b) {
> >>>       cout << " " << e;
> >>>    }
> >>>
> >>>    cout << "\n";
> >>>
> >>>    cout << "a.lba = " << b.lba << "\n";
> >>>
> >>>    return 0;
> >>> }
> >>>
> >>>
> >>> Allen Samuels
> >>> SanDisk |a Western Digital brand
> >>> 2880 Junction Avenue, San Jose, CA 95134
> >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> >>>> Sent: Tuesday, July 12, 2016 8:13 PM
> >>>> To: Sage Weil <sweil@xxxxxxxxxx>; Allen Samuels
> >>>> <Allen.Samuels@xxxxxxxxxxx>
> >>>> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> >>>> Subject: Re: bluestore onode diet and encoding overhead
> >>>>
> >>>>
> >>>>
> >>>> On 07/12/2016 08:50 PM, Sage Weil wrote:
> >>>>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> >>>>>> Good analysis.
> >>>>>>
> >>>>>> My original comments about putting the oNode on a diet included
> >>>>>> the idea of a "custom" encode/decode path for certain high-usage
> cases.
> >>>>>> At the time, Sage resisted going down that path hoping that a
> >>>>>> more optimized generic case would get the job done. Your analysis
> >>>>>> shows that while we've achieved significant space reduction this
> >>>>>> has come at the expense of CPU time -- which dominates small
> >>>>>> object performance (I suspect that eventually we'd discover that
> >>>>>> the variable length decode path would be responsible for a
> >>>>>> substantial read performance degradation also -- which may or may
> >>>>>> not be part of the read performance drop-off that you're seeing).
> >>>>>> This isn't a
> >> surprising
> >>>> result, though it is unfortunate.
> >>>>>>
> >>>>>> I believe we need to revisit the idea of custom encode/decode
> >>>>>> paths for high-usage cases, only now the gains need to be focused
> >>>>>> on CPU utilization as well as space efficiency.
> >>>>>
> >>>>> I still think we can get most or all of the way there in a generic
> >>>>> way by revising the way that we interact with bufferlist for
> >>>>> encode and
> >> decode.
> >>>>> We haven't actually tried to optimize this yet, and the current
> >>>>> code is pretty horribly inefficient (asserts all over the place,
> >>>>> and many layers of pointer indirection to do a simple append).  I
> >>>>> think we need to do two
> >>>>> things:
> >>>>>
> >>>>> 1) decode path: optimize the iterator class so that it has a const
> >>>>> char *current and const char *current_end that point into the
> >>>>> current buffer::ptr.  This way any decode will have a single
> >>>>> pointer
> >>>>> add+comparison to ensure there is enough data to copy before
> >>>>> add+falling into
> >>>>> the slow path (partial buffer, move to next buffer, etc.).
> >>>>>
> >>>>
> >>>> I don't have a good sense yet for how much this is hurting us in
> >>>> the read path.  We screwed something up in the last couple of weeks
> >>>> and small
> >> reads
> >>>> are quite slow.
> >>>>
> >>>>> 2) Having that comparison is still not ideal, but we shoudl
> >>>>> consider ways to get around that too.  For example, if we know
> >>>>> that we are going to decode N M-byte things, we could do an
> >>>>> iterator 'reserve' or 'check' that ensures we have a valid pointer
> >>>>> for that much and then proceed without checks.  The interface here
> >>>>> would be tricky, though, since in the slow case we'll span buffers
> >>>>> and need to magically fall back to a different decode path (hard
> >>>>> to maintain) or do a temporary copy (probably faster but we need
> >>>>> to ensure the iterator owns it and frees is later).  I'd say this
> >>>>> is step 2 and optional; step 1 will have the most
> >>>> benefit.
> >>>>>
> >>>>> 3) encode path: currently all encode methods take a bufferlist&
> >>>>> and the bufferlist itself as an append buffer.  I think this is
> >>>>> flawed and limiting.  Instead, we should make a new class called
> >>>>> buffer::list::appender (or similar) and templatize the encode
> >>>>> methods so they can take a safe_appender (which does bounds
> >>>>> checking) or an unsafe_appender (which does not).  For the latter,
> >>>>> the user takes responsibility for making sure there is enough
> >>>>> space by doing a
> >>>>> reserve() type call which returns an unsafe_appender, and it's
> >>>>> their job to make sure they don't shove too much data into it.
> >>>>> That should make the encode path a memcpy + ptr increment (for
> >>>>> savvy/optimized
> >>>> callers).
> >>>>
> >>>> Seems reasonable and similar in performance to what Piotr and I
> >>>> were discussing this morning.  As a very simple test I was thinking
> >>>> of doing a
> >> quick
> >>>> size computation and then passing that in to increase the
> >>>> append_buffer
> >> size
> >>>> when the bufferlist is created in Bluestore::_txc_write_nodes.  His
> >>>> idea
> >> went
> >>>> a bit farther to break the encapsulation, compute the fully encoded
> >>>> message, and dump it directly into a buffer of a computed size
> >>>> without
> >> the
> >>>> extra assert checks or bounds checking.  Obviously his idea would
> >>>> be
> >> faster
> >>>> but more work.
> >>>>
> >>>> It sounds like your solution would be similar but a bit more formalized.
> >>>>
> >>>>>
> >>>>> I suggest we use bluestore as a test case to make the interfaces
> >>>>> work and be fast.  If we succeed we can take advantage of it
> >>>>> across the reset of the code base as well.
> >>>>
> >>>> Do we have other places in the code with similar byte append
> behavior?
> >>>> That's what's really killing us I think, especially with how small
> >>>> the new append_buffer is when you run out of space when appending
> bytes.
> >>>>
> >>>>>
> >>>>> That's my thinking, at least.  I haven't had time to prototype it
> >>>>> out yet, but I think our goal should be to make the encode/decode
> >>>>> paths capable of being a memcpy + ptr addition in the fast path,
> >>>>> and let that guide the interface...
> >>>>>
> >>>>> sage
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> >>>> majordomo
> >>>>> info at  http://vger.kernel.org/majordomo-info.html
> >>>>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html