RE: bluestore onode diet and encoding overhead

Allen Samuels <Allen.Samuels@xxxxxxxxxxx> · Thu, 14 Jul 2016 05:52:15 +0000

As promised, here's some code that hacks out a new encode/decode framework. That has the advantage of only having to list the fields of a struct once and is pretty much guaranteed to never overrun a buffer....

Comments are requested :)

#include <iostream>
#include <fstream>
#include <set>
#include <string>
#include <string.h>

/*******************************************************

   New fast encode/decode framework.

   The entire framework is built around the idea that each object has three operations:

     ESTIMATE  -- worst-case estimate of the amount of storage required for this object
     ENCODE    -- encode object into buffer of size ESTIMATE
     DECODE    -- encode object from buffer of size actual.

   Each object has a single templated function that actually provides all three operations in a single set of code.
   But doing this, it's pretty much guaranteed that the ESTIMATE and the ENCODE code are in harmony (i.e. that the estimate is correct)
   it also saves a lot of typing/reading...

   Generally, all three operations are provided on a single function name with the input and return parameters overloaded to distinguish them.

   It's observed that for each of the three operations there is a single value which needs to be transmitted between each of the micro-encode/decode calls
   Yes, this is confusing, but let's look at a simple example

    struct simple {
      int a;
      float b;
      string c;
      set<int> d;
    };

    To encode this struct we generate a function that does the micro-encoding of each of the fields of the struct
    Here's an example of a function that does the ESTIMATE operation.

    size_t simple::estimate() {
       return 
          sizeof(a) +
          sizeof(b) +
          c.size() +
          d.size() * sizeof(int);
    }

    We're going to re-write it as:

    size_t simple::estimate(size_t p) {
       p = estimate(p,a);
       p = estimate(p,b);
       p = estimate(p,c);
       p = estimate(p,d);
       return p;
    }

    assuming that the sorta function:

    template<typename t> size_t estimate(size_t p,t& o) { return p + sizeof(o); }
    template<typename t> size_t estimate(size_t p,set<t>& o) { return p + o.size() * sizeof(t); }

    similarly, the encode operation is represented as:

    char * simple::encode(char *p) {
       p = encode(p,a);
       p = encode(p,b);
       p = encode(p,c);
       p = encode(p,d);
       return p;
    }

    similarly, the decode operation is represented as:

    const char * simple::decode(const char *p) {
       p = decode(p,a);
       p = decode(p,b);
       p = decode(p,c);
       p = decode(p,d);
       return p;
    }

You can now see that it's possible to create a single function that does all three operations in a single block
of code, provided that you can fiddle the input/output parameter types appropriately.

In essence the pattern is

    p = enc_dec(p,struct_field_1);
    p = enc_dec(p,struct_field_2);
    p = enc_dec(p,struct_field_3);

With the type of p being set differently for each operation, i.e.,
    for ESTIMATE, p = size_t
    for ENCODE,   p = char *
    for DECODE,   p = const char *

This is the essence of how the encode/decode framework operates. Though there is some more sophistication...

----------------------

We also want to allow the encode/decode machinery to be per-type and to operate 

*****************************************************************************/

using namespace std;

//
// Just like the existing encode/decode machinery. The environment provides a rich set of 
// pre-defined encodes for primitive types and containers
//

#define DEFINE_ENC_DEC_RAW(type) \
inline size_t      enc_dec(size_t p,type &o)      { return p + sizeof(type); } \
inline char *      enc_dec(char *p, type &o)      { *(type *)p = o; return p + sizeof(type); } \
inline const char *enc_dec(const char *p,type &o) { o = *(const type *)p; return p + sizeof(type); }

DEFINE_ENC_DEC_RAW(int);
DEFINE_ENC_DEC_RAW(size_t);

//
// String encode/decode (Yea, I know size_t isn't portable -- this is an EXAMPLE man...)
//
inline size_t enc_dec(size_t p,string& s) { return p + sizeof(size_t) + s.size(); }
inline char * enc_dec(char * p,string& s) { *(size_t *)p = s.size(); memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t) + s.size(); }
inline const char *enc_dec(const char *p,string& s) { s = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) + s.size(); }

//
// Let's do a container.
//
// One of the problems with a container is that making an accurate estimate of the size
// would theoretically require that you walk the entire container and add up the sizes of each element.
// We probably don't want to do that. So here, I do a hack that just assumes that I can fake up a individual element
// and multiple that by the number of elements in a container. This hack works anytime that the estimate function
// for the contained type has a fixed maximum size. BTW, this is safe, if the contained type has a variable size
//  (like set<string>) then it will fault out the first time you run it.
//
// Naturally, something like set<string> or map<string,string> is a highly desirable thing to be able to encode/decode
// there's no reason that you can't create a enc_dec_slow function that properly computes the maximum size by walking the container.
//
template<typename t>
inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t) + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); }

template<typename t>
inline char *enc_dec(char *p,set<t>& s) {
   size_t sz = s.size();
   p = enc_dec(p,sz);
   for (const t& e : s) {
      p = enc_dec(p,const_cast<t&>(e));
   }
   return p;
}

template<typename t>
inline const char *enc_dec(const char *p,set<t>&s) {
   size_t sz;
   p = enc_dec(p,sz);
   while (sz--) {
      t temp;
      p = enc_dec(p,temp);
      s.insert(temp);
   }
   return p;
}

//
// Specialized encode/decode for a single data type. These are invoked explicitly...
//
inline size_t enc_dec_lba(size_t p,int& lba) {
   return p + sizeof(lba); // Max....
}

inline char * enc_dec_lba(char *p,int& lba) {
   *p = 15;
   return p + 1; // blah blah
}

inline const char *enc_dec_lba(const char *p,int& lba) {
   lba = *p;
   return p+1;
}

//
// Specialized encode/decode for more sophisticated things primitives.
//
// Here's an example of a encode/decoder for a pair of fields
//
inline size_t enc_dec_range(size_t p,short& start,short& end) {
   return p + 2 * sizeof(short);
}

inline char *enc_dec_range(char *p, short& start, short& end) {
   short *s = (short *) p;
   s[0] = start;
   s[1] = end;
   return p + sizeof(short) * 2;
}

inline const char *enc_dec_range(const char *p,short& start, short& end) {
   start = *(short *)p;
   end   = *(short *)(p + sizeof(short));
   return p + 2*sizeof(short);
}

//
// Some C++ template wizardry to make the single encode/decode function possible.
//
enum SERIAL_TYPE {
   ESTIMATE,
   ENCODE,
   DECODE
};

template <enum SERIAL_TYPE s> struct serial_type;

template<> struct serial_type<ESTIMATE> { typedef size_t type; };
template<> struct serial_type<ENCODE>   { typedef char * type; };
template<> struct serial_type<DECODE>   { typedef const char *type; };

//
// This macro is the key, it connects the external non-member function to the correct member function.
//
#define DEFINE_STRUCT_ENC_DEC(s) \
inline size_t      enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); } \
inline char *      enc_dec(char *p , s &o)  { return o.enc_dec<ENCODE>(p); } \
inline const char *enc_dec(const char *p,s &o)  { return o.enc_dec<DECODE>(p); }

//
// Our example structure
//
struct astruct {
   int a;
   set<int> b;
   int lba;
   short start,end;

   //
   // <<<<< You need to provide this function just one.
   //
   template<enum SERIAL_TYPE s> typename serial_type<s>::type enc_dec(typename serial_type<s>::type p) {
      p = ::enc_dec(p,a);
      p = ::enc_dec(p,b);
      p = ::enc_dec_lba(p,lba);
      p = ::enc_dec_range(p,start,end);
      return p;
   }
};

//
// This macro connects the global enc_dec to the member function.
// One of these per struct declaration
//
DEFINE_STRUCT_ENC_DEC(astruct);

//
// Here's a simple test program. The real encode/decode framework needs to be connected to bufferlist using the pseudo-code
// that I documented in my previous email.
//

int main(int argc,char **argv) {

   astruct a;
   a.a = 10;
   a.b.insert(2);
   a.b.insert(3);
   a.lba = 12;

   size_t s = a.enc_dec<ESTIMATE>(size_t(0));
   cout << "Estimated size is " << s << "\n";

   char buffer[100];

   char *end = a.enc_dec<ENCODE>(buffer);

   cout << "Actual storage was " << end-buffer << "\n";

   astruct b;

   (void) b.enc_dec<DECODE>(buffer); // decode it

   cout << "A.a = " << b.a << "\n";
   for (auto e : b.b) {
      cout << " " << e;
   }

   cout << "\n";

   cout << "a.lba = " << b.lba << "\n";

   return 0;
}

Allen Samuels
SanDisk |a Western Digital brand
2880 Junction Avenue, San Jose, CA 95134
T: +1 408 801 7030| M: +1 408 780 6416
allen.samuels@xxxxxxxxxxx

> -----Original Message-----
> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx]
> Sent: Tuesday, July 12, 2016 8:13 PM
> To: Sage Weil <sweil@xxxxxxxxxx>; Allen Samuels
> <Allen.Samuels@xxxxxxxxxxx>
> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx>
> Subject: Re: bluestore onode diet and encoding overhead
> 
> 
> 
> On 07/12/2016 08:50 PM, Sage Weil wrote:
> > On Tue, 12 Jul 2016, Allen Samuels wrote:
> >> Good analysis.
> >>
> >> My original comments about putting the oNode on a diet included the
> >> idea of a "custom" encode/decode path for certain high-usage cases.
> >> At the time, Sage resisted going down that path hoping that a more
> >> optimized generic case would get the job done. Your analysis shows
> >> that while we've achieved significant space reduction this has come
> >> at the expense of CPU time -- which dominates small object
> >> performance (I suspect that eventually we'd discover that the
> >> variable length decode path would be responsible for a substantial
> >> read performance degradation also -- which may or may not be part of
> >> the read performance drop-off that you're seeing). This isn't a surprising
> result, though it is unfortunate.
> >>
> >> I believe we need to revisit the idea of custom encode/decode paths
> >> for high-usage cases, only now the gains need to be focused on CPU
> >> utilization as well as space efficiency.
> >
> > I still think we can get most or all of the way there in a generic way
> > by revising the way that we interact with bufferlist for encode and decode.
> > We haven't actually tried to optimize this yet, and the current code
> > is pretty horribly inefficient (asserts all over the place, and many
> > layers of pointer indirection to do a simple append).  I think we need
> > to do two
> > things:
> >
> > 1) decode path: optimize the iterator class so that it has a const
> > char *current and const char *current_end that point into the current
> > buffer::ptr.  This way any decode will have a single pointer
> > add+comparison to ensure there is enough data to copy before falling
> > add+into
> > the slow path (partial buffer, move to next buffer, etc.).
> >
> 
> I don't have a good sense yet for how much this is hurting us in the read
> path.  We screwed something up in the last couple of weeks and small reads
> are quite slow.
> 
> > 2) Having that comparison is still not ideal, but we shoudl consider
> > ways to get around that too.  For example, if we know that we are
> > going to decode N M-byte things, we could do an iterator 'reserve' or
> > 'check' that ensures we have a valid pointer for that much and then
> > proceed without checks.  The interface here would be tricky, though,
> > since in the slow case we'll span buffers and need to magically fall
> > back to a different decode path (hard to maintain) or do a temporary
> > copy (probably faster but we need to ensure the iterator owns it and
> > frees is later).  I'd say this is step 2 and optional; step 1 will have the most
> benefit.
> >
> > 3) encode path: currently all encode methods take a bufferlist& and
> > the bufferlist itself as an append buffer.  I think this is flawed and
> > limiting.  Instead, we should make a new class called
> > buffer::list::appender (or similar) and templatize the encode methods
> > so they can take a safe_appender (which does bounds checking) or an
> > unsafe_appender (which does not).  For the latter, the user takes
> > responsibility for making sure there is enough space by doing a
> > reserve() type call which returns an unsafe_appender, and it's their
> > job to make sure they don't shove too much data into it.  That should
> > make the encode path a memcpy + ptr increment (for savvy/optimized
> callers).
> 
> Seems reasonable and similar in performance to what Piotr and I were
> discussing this morning.  As a very simple test I was thinking of doing a quick
> size computation and then passing that in to increase the append_buffer size
> when the bufferlist is created in Bluestore::_txc_write_nodes.  His idea went
> a bit farther to break the encapsulation, compute the fully encoded
> message, and dump it directly into a buffer of a computed size without the
> extra assert checks or bounds checking.  Obviously his idea would be faster
> but more work.
> 
> It sounds like your solution would be similar but a bit more formalized.
> 
> >
> > I suggest we use bluestore as a test case to make the interfaces work
> > and be fast.  If we succeed we can take advantage of it across the
> > reset of the code base as well.
> 
> Do we have other places in the code with similar byte append behavior?
> That's what's really killing us I think, especially with how small the new
> append_buffer is when you run out of space when appending bytes.
> 
> >
> > That's my thinking, at least.  I haven't had time to prototype it out
> > yet, but I think our goal should be to make the encode/decode paths
> > capable of being a memcpy + ptr addition in the fast path, and let
> > that guide the interface...
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > in the body of a message to majordomo@xxxxxxxxxxxxxxx More
> majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
> >
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html