As promised, here's some code that hacks out a new encode/decode framework. That has the advantage of only having to list the fields of a struct once and is pretty much guaranteed to never overrun a buffer.... Comments are requested :) #include <iostream> #include <fstream> #include <set> #include <string> #include <string.h> /******************************************************* New fast encode/decode framework. The entire framework is built around the idea that each object has three operations: ESTIMATE -- worst-case estimate of the amount of storage required for this object ENCODE -- encode object into buffer of size ESTIMATE DECODE -- encode object from buffer of size actual. Each object has a single templated function that actually provides all three operations in a single set of code. But doing this, it's pretty much guaranteed that the ESTIMATE and the ENCODE code are in harmony (i.e. that the estimate is correct) it also saves a lot of typing/reading... Generally, all three operations are provided on a single function name with the input and return parameters overloaded to distinguish them. It's observed that for each of the three operations there is a single value which needs to be transmitted between each of the micro-encode/decode calls Yes, this is confusing, but let's look at a simple example struct simple { int a; float b; string c; set<int> d; }; To encode this struct we generate a function that does the micro-encoding of each of the fields of the struct Here's an example of a function that does the ESTIMATE operation. size_t simple::estimate() { return sizeof(a) + sizeof(b) + c.size() + d.size() * sizeof(int); } We're going to re-write it as: size_t simple::estimate(size_t p) { p = estimate(p,a); p = estimate(p,b); p = estimate(p,c); p = estimate(p,d); return p; } assuming that the sorta function: template<typename t> size_t estimate(size_t p,t& o) { return p + sizeof(o); } template<typename t> size_t estimate(size_t p,set<t>& o) { return p + o.size() * sizeof(t); } similarly, the encode operation is represented as: char * simple::encode(char *p) { p = encode(p,a); p = encode(p,b); p = encode(p,c); p = encode(p,d); return p; } similarly, the decode operation is represented as: const char * simple::decode(const char *p) { p = decode(p,a); p = decode(p,b); p = decode(p,c); p = decode(p,d); return p; } You can now see that it's possible to create a single function that does all three operations in a single block of code, provided that you can fiddle the input/output parameter types appropriately. In essence the pattern is p = enc_dec(p,struct_field_1); p = enc_dec(p,struct_field_2); p = enc_dec(p,struct_field_3); With the type of p being set differently for each operation, i.e., for ESTIMATE, p = size_t for ENCODE, p = char * for DECODE, p = const char * This is the essence of how the encode/decode framework operates. Though there is some more sophistication... ---------------------- We also want to allow the encode/decode machinery to be per-type and to operate *****************************************************************************/ using namespace std; // // Just like the existing encode/decode machinery. The environment provides a rich set of // pre-defined encodes for primitive types and containers // #define DEFINE_ENC_DEC_RAW(type) \ inline size_t enc_dec(size_t p,type &o) { return p + sizeof(type); } \ inline char * enc_dec(char *p, type &o) { *(type *)p = o; return p + sizeof(type); } \ inline const char *enc_dec(const char *p,type &o) { o = *(const type *)p; return p + sizeof(type); } DEFINE_ENC_DEC_RAW(int); DEFINE_ENC_DEC_RAW(size_t); // // String encode/decode (Yea, I know size_t isn't portable -- this is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) { return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char * p,string& s) { *(size_t *)p = s.size(); memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + sizeof(size_t) + s.size(); } inline const char *enc_dec(const char *p,string& s) { s = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) + s.size(); } // // Let's do a container. // // One of the problems with a container is that making an accurate estimate of the size // would theoretically require that you walk the entire container and add up the sizes of each element. // We probably don't want to do that. So here, I do a hack that just assumes that I can fake up a individual element // and multiple that by the number of elements in a container. This hack works anytime that the estimate function // for the contained type has a fixed maximum size. BTW, this is safe, if the contained type has a variable size // (like set<string>) then it will fault out the first time you run it. // // Naturally, something like set<string> or map<string,string> is a highly desirable thing to be able to encode/decode // there's no reason that you can't create a enc_dec_slow function that properly computes the maximum size by walking the container. // template<typename t> inline size_t enc_dec(size_t p,set<t>& s) { return p + sizeof(size_t) + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); } template<typename t> inline char *enc_dec(char *p,set<t>& s) { size_t sz = s.size(); p = enc_dec(p,sz); for (const t& e : s) { p = enc_dec(p,const_cast<t&>(e)); } return p; } template<typename t> inline const char *enc_dec(const char *p,set<t>&s) { size_t sz; p = enc_dec(p,sz); while (sz--) { t temp; p = enc_dec(p,temp); s.insert(temp); } return p; } // // Specialized encode/decode for a single data type. These are invoked explicitly... // inline size_t enc_dec_lba(size_t p,int& lba) { return p + sizeof(lba); // Max.... } inline char * enc_dec_lba(char *p,int& lba) { *p = 15; return p + 1; // blah blah } inline const char *enc_dec_lba(const char *p,int& lba) { lba = *p; return p+1; } // // Specialized encode/decode for more sophisticated things primitives. // // Here's an example of a encode/decoder for a pair of fields // inline size_t enc_dec_range(size_t p,short& start,short& end) { return p + 2 * sizeof(short); } inline char *enc_dec_range(char *p, short& start, short& end) { short *s = (short *) p; s[0] = start; s[1] = end; return p + sizeof(short) * 2; } inline const char *enc_dec_range(const char *p,short& start, short& end) { start = *(short *)p; end = *(short *)(p + sizeof(short)); return p + 2*sizeof(short); } // // Some C++ template wizardry to make the single encode/decode function possible. // enum SERIAL_TYPE { ESTIMATE, ENCODE, DECODE }; template <enum SERIAL_TYPE s> struct serial_type; template<> struct serial_type<ESTIMATE> { typedef size_t type; }; template<> struct serial_type<ENCODE> { typedef char * type; }; template<> struct serial_type<DECODE> { typedef const char *type; }; // // This macro is the key, it connects the external non-member function to the correct member function. // #define DEFINE_STRUCT_ENC_DEC(s) \ inline size_t enc_dec(size_t p, s &o) { return o.enc_dec<ESTIMATE>(p); } \ inline char * enc_dec(char *p , s &o) { return o.enc_dec<ENCODE>(p); } \ inline const char *enc_dec(const char *p,s &o) { return o.enc_dec<DECODE>(p); } // // Our example structure // struct astruct { int a; set<int> b; int lba; short start,end; // // <<<<< You need to provide this function just one. // template<enum SERIAL_TYPE s> typename serial_type<s>::type enc_dec(typename serial_type<s>::type p) { p = ::enc_dec(p,a); p = ::enc_dec(p,b); p = ::enc_dec_lba(p,lba); p = ::enc_dec_range(p,start,end); return p; } }; // // This macro connects the global enc_dec to the member function. // One of these per struct declaration // DEFINE_STRUCT_ENC_DEC(astruct); // // Here's a simple test program. The real encode/decode framework needs to be connected to bufferlist using the pseudo-code // that I documented in my previous email. // int main(int argc,char **argv) { astruct a; a.a = 10; a.b.insert(2); a.b.insert(3); a.lba = 12; size_t s = a.enc_dec<ESTIMATE>(size_t(0)); cout << "Estimated size is " << s << "\n"; char buffer[100]; char *end = a.enc_dec<ENCODE>(buffer); cout << "Actual storage was " << end-buffer << "\n"; astruct b; (void) b.enc_dec<DECODE>(buffer); // decode it cout << "A.a = " << b.a << "\n"; for (auto e : b.b) { cout << " " << e; } cout << "\n"; cout << "a.lba = " << b.lba << "\n"; return 0; } Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > Sent: Tuesday, July 12, 2016 8:13 PM > To: Sage Weil <sweil@xxxxxxxxxx>; Allen Samuels > <Allen.Samuels@xxxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: bluestore onode diet and encoding overhead > > > > On 07/12/2016 08:50 PM, Sage Weil wrote: > > On Tue, 12 Jul 2016, Allen Samuels wrote: > >> Good analysis. > >> > >> My original comments about putting the oNode on a diet included the > >> idea of a "custom" encode/decode path for certain high-usage cases. > >> At the time, Sage resisted going down that path hoping that a more > >> optimized generic case would get the job done. Your analysis shows > >> that while we've achieved significant space reduction this has come > >> at the expense of CPU time -- which dominates small object > >> performance (I suspect that eventually we'd discover that the > >> variable length decode path would be responsible for a substantial > >> read performance degradation also -- which may or may not be part of > >> the read performance drop-off that you're seeing). This isn't a surprising > result, though it is unfortunate. > >> > >> I believe we need to revisit the idea of custom encode/decode paths > >> for high-usage cases, only now the gains need to be focused on CPU > >> utilization as well as space efficiency. > > > > I still think we can get most or all of the way there in a generic way > > by revising the way that we interact with bufferlist for encode and decode. > > We haven't actually tried to optimize this yet, and the current code > > is pretty horribly inefficient (asserts all over the place, and many > > layers of pointer indirection to do a simple append). I think we need > > to do two > > things: > > > > 1) decode path: optimize the iterator class so that it has a const > > char *current and const char *current_end that point into the current > > buffer::ptr. This way any decode will have a single pointer > > add+comparison to ensure there is enough data to copy before falling > > add+into > > the slow path (partial buffer, move to next buffer, etc.). > > > > I don't have a good sense yet for how much this is hurting us in the read > path. We screwed something up in the last couple of weeks and small reads > are quite slow. > > > 2) Having that comparison is still not ideal, but we shoudl consider > > ways to get around that too. For example, if we know that we are > > going to decode N M-byte things, we could do an iterator 'reserve' or > > 'check' that ensures we have a valid pointer for that much and then > > proceed without checks. The interface here would be tricky, though, > > since in the slow case we'll span buffers and need to magically fall > > back to a different decode path (hard to maintain) or do a temporary > > copy (probably faster but we need to ensure the iterator owns it and > > frees is later). I'd say this is step 2 and optional; step 1 will have the most > benefit. > > > > 3) encode path: currently all encode methods take a bufferlist& and > > the bufferlist itself as an append buffer. I think this is flawed and > > limiting. Instead, we should make a new class called > > buffer::list::appender (or similar) and templatize the encode methods > > so they can take a safe_appender (which does bounds checking) or an > > unsafe_appender (which does not). For the latter, the user takes > > responsibility for making sure there is enough space by doing a > > reserve() type call which returns an unsafe_appender, and it's their > > job to make sure they don't shove too much data into it. That should > > make the encode path a memcpy + ptr increment (for savvy/optimized > callers). > > Seems reasonable and similar in performance to what Piotr and I were > discussing this morning. As a very simple test I was thinking of doing a quick > size computation and then passing that in to increase the append_buffer size > when the bufferlist is created in Bluestore::_txc_write_nodes. His idea went > a bit farther to break the encapsulation, compute the fully encoded > message, and dump it directly into a buffer of a computed size without the > extra assert checks or bounds checking. Obviously his idea would be faster > but more work. > > It sounds like your solution would be similar but a bit more formalized. > > > > > I suggest we use bluestore as a test case to make the interfaces work > > and be fast. If we succeed we can take advantage of it across the > > reset of the code base as well. > > Do we have other places in the code with similar byte append behavior? > That's what's really killing us I think, especially with how small the new > append_buffer is when you run out of space when appending bytes. > > > > > That's my thinking, at least. I haven't had time to prototype it out > > yet, but I think our goal should be to make the encode/decode paths > > capable of being a memcpy + ptr addition in the fast path, and let > > that guide the interface... > > > > sage > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > in the body of a message to majordomo@xxxxxxxxxxxxxxx More > majordomo > > info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html