RE: bluestore onode diet and encoding overhead

Sage Weil <sweil@xxxxxxxxxx> · Sat, 13 Aug 2016 21:36:07 +0000 (UTC)

On Fri, 12 Aug 2016, Allen Samuels wrote:
> > -----Original Message-----
> > From: Sage Weil [mailto:sweil@xxxxxxxxxx]
> > Sent: Friday, August 12, 2016 9:18 AM
> > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>
> > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; ceph-devel <ceph-
> > devel@xxxxxxxxxxxxxxx>
> > Subject: RE: bluestore onode diet and encoding overhead
> > 
> > Okay, I finally had some time to read through this thread!
> > 
> > On Thu, 14 Jul 2016, Allen Samuels wrote:
> > > Yes, I did actually run the code before I posted it.
> > >
> > > w.r.t. varint encoding. You have two choices w.r.t. a variable length
> > > encoded, you could examine the data to accurately predict the output
> > > size OR you could just return a constant that represents the
> > > worst-case
> > > (max) size.  For individual fields, it probably doesn't matter what
> > > you chose, but for fields that are part of something in a container,
> > > you probably want the option of NOT running down the container to size
> > > up each element -- so you'd just choose the worst-case size for the
> > > estimator.
> > >
> > > Though this code doesn't show it, I wrote some pseudo-code in a
> > > previous e-mail that glues this framework into the bufferlist stuff.
> > > That pseudo code is well prepared for estimate functions that are too
> > > large (indeed, it expects that to happen) and it naturally handles
> > > buffer overrun detection.
> > >
> > > I didn't describe it in the example, but this framework very naturally
> > > handles versioning, you just add some code like:
> > >
> > > Struct abc {
> > >    Int version;
> > >    Int a;
> > >    Int b;
> > >    ..... enc_dec(p) {
> > >       ::enc_dec(p, version);
> > >       ::enc_dec(p, a);
> > >       If (s != DECODE || version > 5) ::enc_dec(p, b); // This field is present in
> > all estimate and encode operations, but only in decode operations when
> > version is > 5
> > >    }
> > > };
> > 
> > This is pretty cool.  I'm mainly nervous about the versioning stuff.  The
> > current encode/decode scheme already has a length (so we're good there),
> > and also two other fields: struct_v and compat_v, indicating the version of
> > the encoding and the oldest decoder that can understand it.  I don't see any
> > reason why that couldn't be replicated here.  I am a bit nervous about the
> > decoding side, though, since it can get complicated.  What you have above is
> > the common case, but even a moderately simple one (where we didn't have
> > to do anything kludgey) looks like this:
> > 
> > 	https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc#L
> > 2337
> 
> Yes, the struct_v and compat_v stuff can be easily handled and I don't think the code is any more "complicated" than what's done for the object that you showed. In fact the code is almost identical....
> 
> 
> > 
> > I suspect those conditionals would end up looking like
> > 
> >  if (s != DECODE || version > 5) {
> >    ::enc_dec(p, b);
> >  } if (s == DECODE) {
> >    b = default/compat value;
> >  }
> 
> Almost, I think it's simpler. Let's assume that the case we care about is a field that's present only in versions > 5. Then the code is: 
> 
>    if (version > 5) {
>       ::enc_dec(p,b);
     } else {
        b = 1;  // compat value
     }    

If we set a the version/struct_v var in the encode and estimate paths to 
the latest version, then yeah.  It may read a bit strange because the 
compat block appears will only be visited in the decode path (when 
struct_v could be small), but that should be fine.

> 
> Which is pretty much what it looks like in the current code...
> 
> 
> > 
> > I'm still trying to sort out in my head how this relates to the appender thing.  I
> > think they're largely orthogonal, but the estimate function here could be
> > used to drive the unsafe_appender stuff pretty seamlessly.
> > Using the unsafe_appender manually is going to be a lot more error-prone,
> > but should get the same performance benefit, without unifying the
> > encode/decode stuff.
> 
> Yes, they're conceptually orthogonal. In all cases, we make a 
> CONSERVATIVE estimate of the space that's required. Acquire that space 
> and then encode without checking for overflow. At the end of the encode 
> you can forgive the unused space.
> 
> > 
> > I'm a bit worried that the estimate process will be too slow, though.  On a
> > complicated nested object, for example, it will have to traverse the full data
> > structure once to estimate, and then again to encode.  It might be simpler
> > and faster to have the outer parts of the structure operate on a
> > safe_encoder, and construct an unsafe_encoder only when we are explicitly
> > prepared to do the estimate.  For example, we can have a safe_encoder
> > method like
> 
> These are exactly the right tradeoffs. The framework allows you do 
> specify for particular containers whether you need to walk the container 
> or not to make an estimate. The optimization is only important for 
> containers with large numbers of small objects -- in any other case 
> (small numbers, large objects....) the overhead isn't important to 
> optimize out.
> 
> BTW, there's nothing the framework that requires you to do the estimate 
> process once for the whole world, you can absolutely doit in piecemeal 
> like you suggest.

I'm having trouble imagining what this code is going to look like.  How 
close is your branch to a point where I can start playing with it?

> IMO, the most important thing about the estimate/encode cycle is that 
> the estimate NEVER be wrong. I was very worried about a design pattern 
> that separated the estimation process from the actual encoding process 
> in a way that would allow this kind of error to creep in. What I'm 
> worried about is an endless series of production run-time failures due 
> to incorrect estimation for low probability data-dependent cases -- in 
> other words, I consider it a requirement of the design that the 
> estimation process be so tightly linked to the encoding process that 
> possibilities of rare overruns are essentially eliminated by 
> construction. So, I ended up with a design pattern that -- by default -- 
> is guaranteed to "get a good enough" answer [i.e., conservative] at the 
> expensive of time -- but you can selectively override the default in way 
> that I think is pretty safe to get that time back when it matters.

Yeah--I agree here.

sage

> BTW, I expect most enc_dec calls for ESTIMATION to compile to just a few 
> multiplies and adds for anything but a container you have to walk.
> 
> 
> > 
> >   unsafe_encoder reserve(size_t s);
> > 
> > so that we can do
> > 
> >   void encode(bufferlist::safe_appender& ap) const {
> >     ENCODE_START(2, 2);
> >     // do a single range check for all of our simple members
> >     {
> >       unsafe_encoder t = ap.reserve(5 * sizeof(uint64_t));
> >       ::encode(foo, t);
> >       ::encode(bar, t);
> >       ::encode(baz, t);
> >       ::encode(a, t);
> >       ::encode(b, t);
> >     }
> >     // use the safe encoder for some complex ones
> >     ::encode(widget, ap);
> >     ::encode(widget2, ap);
> >     // explicitly estimate a simple container
> >     {
> >       unsafe_encoder t = ap.reserve(sub.size() * known_worst_case);
> >       ::encode(sub, t);
> >     }
> >     // dynamically range check a complex container
> >     ::encode(complex_container, ap);
> >   }
> > 
> > The enc_dec currently forces a full estimate in all cases, even when it's not
> > really needed.  Perhaps we can come up with some set of templates and
> > wrapper functions so that we can use safe and unsafe encoders somewhat
> > interchangeably so that the estimate infrastructure is only triggered when
> > needed?
> 
> I think the characterization of "full estimate in all cases, even when it's not really needed" isn't correct. If you code the above using my framework then for all except the two "complex" containers, the estimation is pretty simple, just a few adds and multiplies.
> 
> It is true for the "complex" containers that you'll walk the container twice, once for the estimate and once for the encode. However, that's the only difference, the actual cost of encoding the elements of the two containers is the same (i.e., a reserve of the size followed by the encoding of the container itself) the only delta is that overhead of the for_each loop itself -- the cost of walking the container. I believe that if the container is "complex", then the cost of walking it, when compared to the cost of checking the buffer size and doing the per-element encode is minimal. In other words, I disagree about the cost of the estimate phase (you're doing the same work just re-distributed, I only walk the container itself once extra).
> 
> 
>  
> > 
> > sage
> > 
> > 
> > 
> > > What this framework doesn't yet handle very well is situations where
> > > you have a container with a contained type that is a primitive (i.e.,
> > > uint8) and you want that contained type to be custom encoded.
> > > Currently, the only solution is replace the contained primitive type
> > > with a class wrapper. Unfortunately a typedef is NOT sufficient to
> > differentiate it.
> > >
> > > Allen Samuels
> > > SanDisk |a Western Digital brand
> > > 2880 Junction Avenue, San Jose, CA 95134
> > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> > >
> > >
> > > > -----Original Message-----
> > > > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> > > > Sent: Thursday, July 14, 2016 4:16 AM
> > > > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil
> > > > <sweil@xxxxxxxxxx>
> > > > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> > > > Subject: Re: bluestore onode diet and encoding overhead
> > > >
> > > > On 07/14/2016 12:52 AM, Allen Samuels wrote:
> > > > > As promised, here's some code that hacks out a new encode/decode
> > > > framework. That has the advantage of only having to list the fields
> > > > of a struct once and is pretty much guaranteed to never overrun a
> > buffer....
> > > > >
> > > > > Comments are requested :)
> > > >
> > > > It compiles! :D
> > > >
> > > > I looked over the code, but I want to look it over again after I've
> > > > had my coffee since I'm still shaking the cobwebs out.  Would the
> > > > idea here be that if you are doing varint encoding for example that
> > > > you always allocate the buffer based on ESTIMATE (also taking into
> > > > account the encoding overhead), but typically expect a much smaller
> > encoding?
> > > >
> > > > As it is, it's very clever.
> > > >
> > > > Mark
> > > >
> > > > >
> > > > >
> > > > > #include <iostream>
> > > > > #include <fstream>
> > > > > #include <set>
> > > > > #include <string>
> > > > > #include <string.h>
> > > > >
> > > > >
> > /*******************************************************
> > > > >
> > > > >
> > > > >    New fast encode/decode framework.
> > > > >
> > > > >    The entire framework is built around the idea that each object
> > > > > has three
> > > > operations:
> > > > >
> > > > >      ESTIMATE  -- worst-case estimate of the amount of storage
> > > > > required for
> > > > this object
> > > > >      ENCODE    -- encode object into buffer of size ESTIMATE
> > > > >      DECODE    -- encode object from buffer of size actual.
> > > > >
> > > > >    Each object has a single templated function that actually
> > > > > provides all three
> > > > operations in a single set of code.
> > > > >    But doing this, it's pretty much guaranteed that the ESTIMATE
> > > > > and the
> > > > ENCODE code are in harmony (i.e. that the estimate is correct)
> > > > >    it also saves a lot of typing/reading...
> > > > >
> > > > >    Generally, all three operations are provided on a single
> > > > > function name
> > > > with the input and return parameters overloaded to distinguish them.
> > > > >
> > > > >    It's observed that for each of the three operations there is a
> > > > > single value
> > > > which needs to be transmitted between each of the
> > > > micro-encode/decode calls
> > > > >    Yes, this is confusing, but let's look at a simple example
> > > > >
> > > > >     struct simple {
> > > > >       int a;
> > > > >       float b;
> > > > >       string c;
> > > > >       set<int> d;
> > > > >     };
> > > > >
> > > > >     To encode this struct we generate a function that does the
> > > > > micro-
> > > > encoding of each of the fields of the struct
> > > > >     Here's an example of a function that does the ESTIMATE operation.
> > > > >
> > > > >     size_t simple::estimate() {
> > > > >        return
> > > > >           sizeof(a) +
> > > > >           sizeof(b) +
> > > > >           c.size() +
> > > > >           d.size() * sizeof(int);
> > > > >     }
> > > > >
> > > > >     We're going to re-write it as:
> > > > >
> > > > >     size_t simple::estimate(size_t p) {
> > > > >        p = estimate(p,a);
> > > > >        p = estimate(p,b);
> > > > >        p = estimate(p,c);
> > > > >        p = estimate(p,d);
> > > > >        return p;
> > > > >     }
> > > > >
> > > > >     assuming that the sorta function:
> > > > >
> > > > >     template<typename t> size_t estimate(size_t p,t& o) { return p
> > > > > +
> > > > sizeof(o); }
> > > > >     template<typename t> size_t estimate(size_t p,set<t>& o) {
> > > > > return p + o.size() * sizeof(t); }
> > > > >
> > > > >
> > > > >     similarly, the encode operation is represented as:
> > > > >
> > > > >     char * simple::encode(char *p) {
> > > > >        p = encode(p,a);
> > > > >        p = encode(p,b);
> > > > >        p = encode(p,c);
> > > > >        p = encode(p,d);
> > > > >        return p;
> > > > >     }
> > > > >
> > > > >     similarly, the decode operation is represented as:
> > > > >
> > > > >     const char * simple::decode(const char *p) {
> > > > >        p = decode(p,a);
> > > > >        p = decode(p,b);
> > > > >        p = decode(p,c);
> > > > >        p = decode(p,d);
> > > > >        return p;
> > > > >     }
> > > > >
> > > > >
> > > > > You can now see that it's possible to create a single function
> > > > > that does all three operations in a single block of code, provided
> > > > > that you can
> > > > fiddle the input/output parameter types appropriately.
> > > > >
> > > > > In essence the pattern is
> > > > >
> > > > >     p = enc_dec(p,struct_field_1);
> > > > >     p = enc_dec(p,struct_field_2);
> > > > >     p = enc_dec(p,struct_field_3);
> > > > >
> > > > > With the type of p being set differently for each operation, i.e.,
> > > > >     for ESTIMATE, p = size_t
> > > > >     for ENCODE,   p = char *
> > > > >     for DECODE,   p = const char *
> > > > >
> > > > > This is the essence of how the encode/decode framework operates.
> > > > Though there is some more sophistication...
> > > > >
> > > > > ----------------------
> > > > >
> > > > > We also want to allow the encode/decode machinery to be per-type
> > > > > and to operate
> > > > >
> > > > >
> > > >
> > **********************************************************
> > > > ************
> > > > > *******/
> > > > >
> > > > > using namespace std;
> > > > >
> > > > > //
> > > > > // Just like the existing encode/decode machinery. The environment
> > > > > provides a rich set of // pre-defined encodes for primitive types
> > > > > and containers //
> > > > >
> > > > > #define DEFINE_ENC_DEC_RAW(type) \
> > > > > inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
> > > > > inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p +
> > > > sizeof(type); } \
> > > > > inline const char *enc_dec(const char *p,type &o) { o = *(const
> > > > > type *)p; return p + sizeof(type); }
> > > > >
> > > > > DEFINE_ENC_DEC_RAW(int);
> > > > > DEFINE_ENC_DEC_RAW(size_t);
> > > > >
> > > > > //
> > > > > // String encode/decode (Yea, I know size_t isn't portable -- this
> > > > > is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s)
> > > > > { return p + sizeof(size_t) + s.size(); } inline char *
> > > > > enc_dec(char * p,string& s) { *(size_t *)p = s.size();
> > > > > memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p +
> > > > > sizeof(size_t)
> > > > > + s.size(); } inline const char *enc_dec(const char *p,string& s)
> > > > > + { s
> > > > > = string(p + sizeof(size_t),*(size_t *)p); return p +
> > > > > sizeof(size_t) + s.size(); }
> > > > >
> > > > > //
> > > > > // Let's do a container.
> > > > > //
> > > > > // One of the problems with a container is that making an accurate
> > > > > estimate of the size // would theoretically require that you walk
> > > > > the entire
> > > > container and add up the sizes of each element.
> > > > > // We probably don't want to do that. So here, I do a hack that
> > > > > just assumes that I can fake up a individual element // and
> > > > > multiple that by the number of elements in a container. This hack
> > > > > works anytime that the estimate function // for the contained type has
> > a fixed maximum size.
> > > > BTW, this is safe, if the contained type has a variable size //
> > > > (like set<string>) then it will fault out the first time you run it.
> > > > > //
> > > > > // Naturally, something like set<string> or map<string,string> is
> > > > > a highly desirable thing to be able to encode/decode // there's no
> > > > > reason
> > > > that you can't create a enc_dec_slow function that properly computes
> > > > the maximum size by walking the container.
> > > > > //
> > > > > template<typename t>
> > > > > inline size_t enc_dec(size_t p,set<t>& s) { return p +
> > > > > sizeof(size_t)
> > > > > + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }
> > > > >
> > > > > template<typename t>
> > > > > inline char *enc_dec(char *p,set<t>& s) {
> > > > >    size_t sz = s.size();
> > > > >    p = enc_dec(p,sz);
> > > > >    for (const t& e : s) {
> > > > >       p = enc_dec(p,const_cast<t&>(e));
> > > > >    }
> > > > >    return p;
> > > > > }
> > > > >
> > > > > template<typename t>
> > > > > inline const char *enc_dec(const char *p,set<t>&s) {
> > > > >    size_t sz;
> > > > >    p = enc_dec(p,sz);
> > > > >    while (sz--) {
> > > > >       t temp;
> > > > >       p = enc_dec(p,temp);
> > > > >       s.insert(temp);
> > > > >    }
> > > > >    return p;
> > > > > }
> > > > >
> > > > > //
> > > > > // Specialized encode/decode for a single data type. These are
> > > > > invoked
> > > > explicitly...
> > > > > //
> > > > > inline size_t enc_dec_lba(size_t p,int& lba) {
> > > > >    return p + sizeof(lba); // Max....
> > > > > }
> > > > >
> > > > > inline char * enc_dec_lba(char *p,int& lba) {
> > > > >    *p = 15;
> > > > >    return p + 1; // blah blah
> > > > > }
> > > > >
> > > > > inline const char *enc_dec_lba(const char *p,int& lba) {
> > > > >    lba = *p;
> > > > >    return p+1;
> > > > > }
> > > > >
> > > > > //
> > > > > // Specialized encode/decode for more sophisticated things primitives.
> > > > > //
> > > > > // Here's an example of a encode/decoder for a pair of fields //
> > > > > inline size_t enc_dec_range(size_t p,short& start,short& end) {
> > > > >    return p + 2 * sizeof(short);
> > > > > }
> > > > >
> > > > > inline char *enc_dec_range(char *p, short& start, short& end) {
> > > > >    short *s = (short *) p;
> > > > >    s[0] = start;
> > > > >    s[1] = end;
> > > > >    return p + sizeof(short) * 2;
> > > > > }
> > > > >
> > > > > inline const char *enc_dec_range(const char *p,short& start, short&
> > end) {
> > > > >    start = *(short *)p;
> > > > >    end   = *(short *)(p + sizeof(short));
> > > > >    return p + 2*sizeof(short);
> > > > > }
> > > > >
> > > > >
> > > > > //
> > > > > // Some C++ template wizardry to make the single encode/decode
> > > > function possible.
> > > > > //
> > > > > enum SERIAL_TYPE {
> > > > >    ESTIMATE,
> > > > >    ENCODE,
> > > > >    DECODE
> > > > > };
> > > > >
> > > > > template <enum SERIAL_TYPE s> struct serial_type;
> > > > >
> > > > > template<> struct serial_type<ESTIMATE> { typedef size_t type; };
> > > > > template<> struct serial_type<ENCODE>   { typedef char * type; };
> > > > > template<> struct serial_type<DECODE>   { typedef const char *type; };
> > > > >
> > > > > //
> > > > > // This macro is the key, it connects the external non-member
> > > > > function to
> > > > the correct member function.
> > > > > //
> > > > > #define DEFINE_STRUCT_ENC_DEC(s) \
> > > > > inline size_t      enc_dec(size_t p, s &o) { return
> > o.enc_dec<ESTIMATE>(p); }
> > > > \
> > > > > inline char *      enc_dec(char *p , s &o)  { return
> > o.enc_dec<ENCODE>(p); }
> > > > \
> > > > > inline const char *enc_dec(const char *p,s &o)  { return
> > > > > o.enc_dec<DECODE>(p); }
> > > > >
> > > > > //
> > > > > // Our example structure
> > > > > //
> > > > > struct astruct {
> > > > >    int a;
> > > > >    set<int> b;
> > > > >    int lba;
> > > > >    short start,end;
> > > > >
> > > > >    //
> > > > >    // <<<<< You need to provide this function just one.
> > > > >    //
> > > > >    template<enum SERIAL_TYPE s> typename serial_type<s>::type
> > > > enc_dec(typename serial_type<s>::type p) {
> > > > >       p = ::enc_dec(p,a);
> > > > >       p = ::enc_dec(p,b);
> > > > >       p = ::enc_dec_lba(p,lba);
> > > > >       p = ::enc_dec_range(p,start,end);
> > > > >       return p;
> > > > >    }
> > > > > };
> > > > >
> > > > > //
> > > > > // This macro connects the global enc_dec to the member function.
> > > > > // One of these per struct declaration //
> > > > > DEFINE_STRUCT_ENC_DEC(astruct);
> > > > >
> > > > >
> > > > > //
> > > > > // Here's a simple test program. The real encode/decode framework
> > > > > needs to be connected to bufferlist using the pseudo-code // that
> > > > > I
> > > > documented in my previous email.
> > > > > //
> > > > >
> > > > > int main(int argc,char **argv) {
> > > > >
> > > > >    astruct a;
> > > > >    a.a = 10;
> > > > >    a.b.insert(2);
> > > > >    a.b.insert(3);
> > > > >    a.lba = 12;
> > > > >
> > > > >    size_t s = a.enc_dec<ESTIMATE>(size_t(0));
> > > > >    cout << "Estimated size is " << s << "\n";
> > > > >
> > > > >    char buffer[100];
> > > > >
> > > > >    char *end = a.enc_dec<ENCODE>(buffer);
> > > > >
> > > > >    cout << "Actual storage was " << end-buffer << "\n";
> > > > >
> > > > >    astruct b;
> > > > >
> > > > >    (void) b.enc_dec<DECODE>(buffer); // decode it
> > > > >
> > > > >    cout << "A.a = " << b.a << "\n";
> > > > >    for (auto e : b.b) {
> > > > >       cout << " " << e;
> > > > >    }
> > > > >
> > > > >    cout << "\n";
> > > > >
> > > > >    cout << "a.lba = " << b.lba << "\n";
> > > > >
> > > > >    return 0;
> > > > > }
> > > > >
> > > > >
> > > > > Allen Samuels
> > > > > SanDisk |a Western Digital brand
> > > > > 2880 Junction Avenue, San Jose, CA 95134
> > > > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx
> > > > >
> > > > >
> > > > >> -----Original Message-----
> > > > >> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> > > > >> Sent: Tuesday, July 12, 2016 8:13 PM
> > > > >> To: Sage Weil <sweil@xxxxxxxxxx>; Allen Samuels
> > > > >> <Allen.Samuels@xxxxxxxxxxx>
> > > > >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> > > > >> Subject: Re: bluestore onode diet and encoding overhead
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 07/12/2016 08:50 PM, Sage Weil wrote:
> > > > >>> On Tue, 12 Jul 2016, Allen Samuels wrote:
> > > > >>>> Good analysis.
> > > > >>>>
> > > > >>>> My original comments about putting the oNode on a diet included
> > > > >>>> the idea of a "custom" encode/decode path for certain high-usage
> > cases.
> > > > >>>> At the time, Sage resisted going down that path hoping that a
> > > > >>>> more optimized generic case would get the job done. Your
> > > > >>>> analysis shows that while we've achieved significant space
> > > > >>>> reduction this has come at the expense of CPU time -- which
> > > > >>>> dominates small object performance (I suspect that eventually
> > > > >>>> we'd discover that the variable length decode path would be
> > > > >>>> responsible for a substantial read performance degradation also
> > > > >>>> -- which may or may not be part of the read performance
> > > > >>>> drop-off that you're seeing). This isn't a
> > > > surprising
> > > > >> result, though it is unfortunate.
> > > > >>>>
> > > > >>>> I believe we need to revisit the idea of custom encode/decode
> > > > >>>> paths for high-usage cases, only now the gains need to be
> > > > >>>> focused on CPU utilization as well as space efficiency.
> > > > >>>
> > > > >>> I still think we can get most or all of the way there in a
> > > > >>> generic way by revising the way that we interact with bufferlist
> > > > >>> for encode and
> > > > decode.
> > > > >>> We haven't actually tried to optimize this yet, and the current
> > > > >>> code is pretty horribly inefficient (asserts all over the place,
> > > > >>> and many layers of pointer indirection to do a simple append).
> > > > >>> I think we need to do two
> > > > >>> things:
> > > > >>>
> > > > >>> 1) decode path: optimize the iterator class so that it has a
> > > > >>> const char *current and const char *current_end that point into
> > > > >>> the current buffer::ptr.  This way any decode will have a single
> > > > >>> pointer
> > > > >>> add+comparison to ensure there is enough data to copy before
> > > > >>> add+falling into
> > > > >>> the slow path (partial buffer, move to next buffer, etc.).
> > > > >>>
> > > > >>
> > > > >> I don't have a good sense yet for how much this is hurting us in
> > > > >> the read path.  We screwed something up in the last couple of
> > > > >> weeks and small
> > > > reads
> > > > >> are quite slow.
> > > > >>
> > > > >>> 2) Having that comparison is still not ideal, but we shoudl
> > > > >>> consider ways to get around that too.  For example, if we know
> > > > >>> that we are going to decode N M-byte things, we could do an
> > > > >>> iterator 'reserve' or 'check' that ensures we have a valid
> > > > >>> pointer for that much and then proceed without checks.  The
> > > > >>> interface here would be tricky, though, since in the slow case
> > > > >>> we'll span buffers and need to magically fall back to a
> > > > >>> different decode path (hard to maintain) or do a temporary copy
> > > > >>> (probably faster but we need to ensure the iterator owns it and
> > > > >>> frees is later).  I'd say this is step 2 and optional; step 1
> > > > >>> will have the most
> > > > >> benefit.
> > > > >>>
> > > > >>> 3) encode path: currently all encode methods take a bufferlist&
> > > > >>> and the bufferlist itself as an append buffer.  I think this is
> > > > >>> flawed and limiting.  Instead, we should make a new class called
> > > > >>> buffer::list::appender (or similar) and templatize the encode
> > > > >>> methods so they can take a safe_appender (which does bounds
> > > > >>> checking) or an unsafe_appender (which does not).  For the
> > > > >>> latter, the user takes responsibility for making sure there is
> > > > >>> enough space by doing a
> > > > >>> reserve() type call which returns an unsafe_appender, and it's
> > > > >>> their job to make sure they don't shove too much data into it.
> > > > >>> That should make the encode path a memcpy + ptr increment (for
> > > > >>> savvy/optimized
> > > > >> callers).
> > > > >>
> > > > >> Seems reasonable and similar in performance to what Piotr and I
> > > > >> were discussing this morning.  As a very simple test I was
> > > > >> thinking of doing a
> > > > quick
> > > > >> size computation and then passing that in to increase the
> > > > >> append_buffer
> > > > size
> > > > >> when the bufferlist is created in Bluestore::_txc_write_nodes.
> > > > >> His idea
> > > > went
> > > > >> a bit farther to break the encapsulation, compute the fully
> > > > >> encoded message, and dump it directly into a buffer of a computed
> > > > >> size without
> > > > the
> > > > >> extra assert checks or bounds checking.  Obviously his idea would
> > > > >> be
> > > > faster
> > > > >> but more work.
> > > > >>
> > > > >> It sounds like your solution would be similar but a bit more formalized.
> > > > >>
> > > > >>>
> > > > >>> I suggest we use bluestore as a test case to make the interfaces
> > > > >>> work and be fast.  If we succeed we can take advantage of it
> > > > >>> across the reset of the code base as well.
> > > > >>
> > > > >> Do we have other places in the code with similar byte append
> > behavior?
> > > > >> That's what's really killing us I think, especially with how
> > > > >> small the new append_buffer is when you run out of space when
> > appending bytes.
> > > > >>
> > > > >>>
> > > > >>> That's my thinking, at least.  I haven't had time to prototype
> > > > >>> it out yet, but I think our goal should be to make the
> > > > >>> encode/decode paths capable of being a memcpy + ptr addition in
> > > > >>> the fast path, and let that guide the interface...
> > > > >>>
> > > > >>> sage
> > > > >>> --
> > > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> > > > >> majordomo
> > > > >>> info at  http://vger.kernel.org/majordomo-info.html
> > > > >>>
> > >
> > >
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html