Ideally, yes. Either it happens tomorrow or two weeks from now. Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, San Jose, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > Sent: Thursday, July 14, 2016 9:32 AM > To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > <sweil@xxxxxxxxxx> > Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: bluestore onode diet and encoding overhead > > So right now I'm knee deep in bisecting bluestore to track down our read > regression, but if you want to throw together a PR that uses this for encode > in bluestore I'd be certainly happy to give it a whirl on our test cluster. > > Mark > > On 07/14/2016 11:20 AM, Allen Samuels wrote: > > BTW, I see this stuff as gradually replacing the existing encode/decode > infrastructure. It's pretty easy to have them side-by-side as well as have the > new infrastructure be wire-compatible with the current stuff. That'll allow a > slow conversion from the old-style to the new-style. The only that's really > different between the two (on the wire) is that I proposed the new stuff to > have a length prefix so that the decoder knows how much data to > "straighten" before launching the fast decode (this is the equivalent of the > ESTIMATE phase during encode). For old-style stuff that doesn't have the > prefix, you'll have to "straighten" the entire remainder of the buffer -- this > may limit the rate of conversion (in that you can only afford to covert code to > the new style when you know that the overhead of straightening is > affordable -- probably because you know that there's not much data present > OR you assume that you're in a temporary transient environment. > > > > > > Allen Samuels > > SanDisk |a Western Digital brand > > 2880 Junction Avenue, San Jose, CA 95134 > > T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > > > >> -----Original Message----- > >> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > >> Sent: Thursday, July 14, 2016 4:16 AM > >> To: Allen Samuels <Allen.Samuels@xxxxxxxxxxx>; Sage Weil > >> <sweil@xxxxxxxxxx> > >> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > >> Subject: Re: bluestore onode diet and encoding overhead > >> > >> On 07/14/2016 12:52 AM, Allen Samuels wrote: > >>> As promised, here's some code that hacks out a new encode/decode > >> framework. That has the advantage of only having to list the fields > >> of a struct once and is pretty much guaranteed to never overrun a > buffer.... > >>> > >>> Comments are requested :) > >> > >> It compiles! :D > >> > >> I looked over the code, but I want to look it over again after I've > >> had my coffee since I'm still shaking the cobwebs out. Would the > >> idea here be that if you are doing varint encoding for example that > >> you always allocate the buffer based on ESTIMATE (also taking into > >> account the encoding overhead), but typically expect a much smaller > encoding? > >> > >> As it is, it's very clever. > >> > >> Mark > >> > >>> > >>> > >>> #include <iostream> > >>> #include <fstream> > >>> #include <set> > >>> #include <string> > >>> #include <string.h> > >>> > >>> > /******************************************************* > >>> > >>> > >>> New fast encode/decode framework. > >>> > >>> The entire framework is built around the idea that each object > >>> has three > >> operations: > >>> > >>> ESTIMATE -- worst-case estimate of the amount of storage > >>> required for > >> this object > >>> ENCODE -- encode object into buffer of size ESTIMATE > >>> DECODE -- encode object from buffer of size actual. > >>> > >>> Each object has a single templated function that actually > >>> provides all three > >> operations in a single set of code. > >>> But doing this, it's pretty much guaranteed that the ESTIMATE and > >>> the > >> ENCODE code are in harmony (i.e. that the estimate is correct) > >>> it also saves a lot of typing/reading... > >>> > >>> Generally, all three operations are provided on a single function > >>> name > >> with the input and return parameters overloaded to distinguish them. > >>> > >>> It's observed that for each of the three operations there is a > >>> single value > >> which needs to be transmitted between each of the micro- > encode/decode > >> calls > >>> Yes, this is confusing, but let's look at a simple example > >>> > >>> struct simple { > >>> int a; > >>> float b; > >>> string c; > >>> set<int> d; > >>> }; > >>> > >>> To encode this struct we generate a function that does the > >>> micro- > >> encoding of each of the fields of the struct > >>> Here's an example of a function that does the ESTIMATE operation. > >>> > >>> size_t simple::estimate() { > >>> return > >>> sizeof(a) + > >>> sizeof(b) + > >>> c.size() + > >>> d.size() * sizeof(int); > >>> } > >>> > >>> We're going to re-write it as: > >>> > >>> size_t simple::estimate(size_t p) { > >>> p = estimate(p,a); > >>> p = estimate(p,b); > >>> p = estimate(p,c); > >>> p = estimate(p,d); > >>> return p; > >>> } > >>> > >>> assuming that the sorta function: > >>> > >>> template<typename t> size_t estimate(size_t p,t& o) { return p + > >> sizeof(o); } > >>> template<typename t> size_t estimate(size_t p,set<t>& o) { > >>> return p + o.size() * sizeof(t); } > >>> > >>> > >>> similarly, the encode operation is represented as: > >>> > >>> char * simple::encode(char *p) { > >>> p = encode(p,a); > >>> p = encode(p,b); > >>> p = encode(p,c); > >>> p = encode(p,d); > >>> return p; > >>> } > >>> > >>> similarly, the decode operation is represented as: > >>> > >>> const char * simple::decode(const char *p) { > >>> p = decode(p,a); > >>> p = decode(p,b); > >>> p = decode(p,c); > >>> p = decode(p,d); > >>> return p; > >>> } > >>> > >>> > >>> You can now see that it's possible to create a single function that > >>> does all three operations in a single block of code, provided that > >>> you can > >> fiddle the input/output parameter types appropriately. > >>> > >>> In essence the pattern is > >>> > >>> p = enc_dec(p,struct_field_1); > >>> p = enc_dec(p,struct_field_2); > >>> p = enc_dec(p,struct_field_3); > >>> > >>> With the type of p being set differently for each operation, i.e., > >>> for ESTIMATE, p = size_t > >>> for ENCODE, p = char * > >>> for DECODE, p = const char * > >>> > >>> This is the essence of how the encode/decode framework operates. > >> Though there is some more sophistication... > >>> > >>> ---------------------- > >>> > >>> We also want to allow the encode/decode machinery to be per-type and > >>> to operate > >>> > >>> > >> > ********************************************************** > >> ************ > >>> *******/ > >>> > >>> using namespace std; > >>> > >>> // > >>> // Just like the existing encode/decode machinery. The environment > >>> provides a rich set of // pre-defined encodes for primitive types > >>> and containers // > >>> > >>> #define DEFINE_ENC_DEC_RAW(type) \ > >>> inline size_t enc_dec(size_t p,type &o) { return p + sizeof(type); } \ > >>> inline char * enc_dec(char *p, type &o) { *(type *)p = o; return p + > >> sizeof(type); } \ > >>> inline const char *enc_dec(const char *p,type &o) { o = *(const type > >>> *)p; return p + sizeof(type); } > >>> > >>> DEFINE_ENC_DEC_RAW(int); > >>> DEFINE_ENC_DEC_RAW(size_t); > >>> > >>> // > >>> // String encode/decode (Yea, I know size_t isn't portable -- this > >>> is an EXAMPLE man...) // inline size_t enc_dec(size_t p,string& s) { > >>> return p + sizeof(size_t) + s.size(); } inline char * enc_dec(char * > >>> p,string& s) { *(size_t *)p = s.size(); > >>> memcpy(p+sizeof(size_t),s.c_str(),s.size()); return p + > >>> sizeof(size_t) > >>> + s.size(); } inline const char *enc_dec(const char *p,string& s) { > >>> + s > >>> = string(p + sizeof(size_t),*(size_t *)p); return p + sizeof(size_t) > >>> + s.size(); } > >>> > >>> // > >>> // Let's do a container. > >>> // > >>> // One of the problems with a container is that making an accurate > >>> estimate of the size // would theoretically require that you walk > >>> the entire > >> container and add up the sizes of each element. > >>> // We probably don't want to do that. So here, I do a hack that just > >>> assumes that I can fake up a individual element // and multiple that > >>> by the number of elements in a container. This hack works anytime > >>> that the estimate function // for the contained type has a fixed > maximum size. > >> BTW, this is safe, if the contained type has a variable size // > >> (like set<string>) then it will fault out the first time you run it. > >>> // > >>> // Naturally, something like set<string> or map<string,string> is a > >>> highly desirable thing to be able to encode/decode // there's no > >>> reason > >> that you can't create a enc_dec_slow function that properly computes > >> the maximum size by walking the container. > >>> // > >>> template<typename t> > >>> inline size_t enc_dec(size_t p,set<t>& s) { return p + > >>> sizeof(size_t) > >>> + (s.size() * ::enc_dec(size_t(0),*(t *) 0)); } > >>> > >>> template<typename t> > >>> inline char *enc_dec(char *p,set<t>& s) { > >>> size_t sz = s.size(); > >>> p = enc_dec(p,sz); > >>> for (const t& e : s) { > >>> p = enc_dec(p,const_cast<t&>(e)); > >>> } > >>> return p; > >>> } > >>> > >>> template<typename t> > >>> inline const char *enc_dec(const char *p,set<t>&s) { > >>> size_t sz; > >>> p = enc_dec(p,sz); > >>> while (sz--) { > >>> t temp; > >>> p = enc_dec(p,temp); > >>> s.insert(temp); > >>> } > >>> return p; > >>> } > >>> > >>> // > >>> // Specialized encode/decode for a single data type. These are > >>> invoked > >> explicitly... > >>> // > >>> inline size_t enc_dec_lba(size_t p,int& lba) { > >>> return p + sizeof(lba); // Max.... > >>> } > >>> > >>> inline char * enc_dec_lba(char *p,int& lba) { > >>> *p = 15; > >>> return p + 1; // blah blah > >>> } > >>> > >>> inline const char *enc_dec_lba(const char *p,int& lba) { > >>> lba = *p; > >>> return p+1; > >>> } > >>> > >>> // > >>> // Specialized encode/decode for more sophisticated things primitives. > >>> // > >>> // Here's an example of a encode/decoder for a pair of fields // > >>> inline size_t enc_dec_range(size_t p,short& start,short& end) { > >>> return p + 2 * sizeof(short); > >>> } > >>> > >>> inline char *enc_dec_range(char *p, short& start, short& end) { > >>> short *s = (short *) p; > >>> s[0] = start; > >>> s[1] = end; > >>> return p + sizeof(short) * 2; > >>> } > >>> > >>> inline const char *enc_dec_range(const char *p,short& start, short& > end) { > >>> start = *(short *)p; > >>> end = *(short *)(p + sizeof(short)); > >>> return p + 2*sizeof(short); > >>> } > >>> > >>> > >>> // > >>> // Some C++ template wizardry to make the single encode/decode > >> function possible. > >>> // > >>> enum SERIAL_TYPE { > >>> ESTIMATE, > >>> ENCODE, > >>> DECODE > >>> }; > >>> > >>> template <enum SERIAL_TYPE s> struct serial_type; > >>> > >>> template<> struct serial_type<ESTIMATE> { typedef size_t type; }; > >>> template<> struct serial_type<ENCODE> { typedef char * type; }; > >>> template<> struct serial_type<DECODE> { typedef const char *type; }; > >>> > >>> // > >>> // This macro is the key, it connects the external non-member > >>> function to > >> the correct member function. > >>> // > >>> #define DEFINE_STRUCT_ENC_DEC(s) \ > >>> inline size_t enc_dec(size_t p, s &o) { return > o.enc_dec<ESTIMATE>(p); } > >> \ > >>> inline char * enc_dec(char *p , s &o) { return > o.enc_dec<ENCODE>(p); } > >> \ > >>> inline const char *enc_dec(const char *p,s &o) { return > >>> o.enc_dec<DECODE>(p); } > >>> > >>> // > >>> // Our example structure > >>> // > >>> struct astruct { > >>> int a; > >>> set<int> b; > >>> int lba; > >>> short start,end; > >>> > >>> // > >>> // <<<<< You need to provide this function just one. > >>> // > >>> template<enum SERIAL_TYPE s> typename serial_type<s>::type > >> enc_dec(typename serial_type<s>::type p) { > >>> p = ::enc_dec(p,a); > >>> p = ::enc_dec(p,b); > >>> p = ::enc_dec_lba(p,lba); > >>> p = ::enc_dec_range(p,start,end); > >>> return p; > >>> } > >>> }; > >>> > >>> // > >>> // This macro connects the global enc_dec to the member function. > >>> // One of these per struct declaration // > >>> DEFINE_STRUCT_ENC_DEC(astruct); > >>> > >>> > >>> // > >>> // Here's a simple test program. The real encode/decode framework > >>> needs to be connected to bufferlist using the pseudo-code // that I > >> documented in my previous email. > >>> // > >>> > >>> int main(int argc,char **argv) { > >>> > >>> astruct a; > >>> a.a = 10; > >>> a.b.insert(2); > >>> a.b.insert(3); > >>> a.lba = 12; > >>> > >>> size_t s = a.enc_dec<ESTIMATE>(size_t(0)); > >>> cout << "Estimated size is " << s << "\n"; > >>> > >>> char buffer[100]; > >>> > >>> char *end = a.enc_dec<ENCODE>(buffer); > >>> > >>> cout << "Actual storage was " << end-buffer << "\n"; > >>> > >>> astruct b; > >>> > >>> (void) b.enc_dec<DECODE>(buffer); // decode it > >>> > >>> cout << "A.a = " << b.a << "\n"; > >>> for (auto e : b.b) { > >>> cout << " " << e; > >>> } > >>> > >>> cout << "\n"; > >>> > >>> cout << "a.lba = " << b.lba << "\n"; > >>> > >>> return 0; > >>> } > >>> > >>> > >>> Allen Samuels > >>> SanDisk |a Western Digital brand > >>> 2880 Junction Avenue, San Jose, CA 95134 > >>> T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > >>> > >>> > >>>> -----Original Message----- > >>>> From: Mark Nelson [mailto:mnelson@xxxxxxxxxx] > >>>> Sent: Tuesday, July 12, 2016 8:13 PM > >>>> To: Sage Weil <sweil@xxxxxxxxxx>; Allen Samuels > >>>> <Allen.Samuels@xxxxxxxxxxx> > >>>> Cc: ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > >>>> Subject: Re: bluestore onode diet and encoding overhead > >>>> > >>>> > >>>> > >>>> On 07/12/2016 08:50 PM, Sage Weil wrote: > >>>>> On Tue, 12 Jul 2016, Allen Samuels wrote: > >>>>>> Good analysis. > >>>>>> > >>>>>> My original comments about putting the oNode on a diet included > >>>>>> the idea of a "custom" encode/decode path for certain high-usage > cases. > >>>>>> At the time, Sage resisted going down that path hoping that a > >>>>>> more optimized generic case would get the job done. Your analysis > >>>>>> shows that while we've achieved significant space reduction this > >>>>>> has come at the expense of CPU time -- which dominates small > >>>>>> object performance (I suspect that eventually we'd discover that > >>>>>> the variable length decode path would be responsible for a > >>>>>> substantial read performance degradation also -- which may or may > >>>>>> not be part of the read performance drop-off that you're seeing). > >>>>>> This isn't a > >> surprising > >>>> result, though it is unfortunate. > >>>>>> > >>>>>> I believe we need to revisit the idea of custom encode/decode > >>>>>> paths for high-usage cases, only now the gains need to be focused > >>>>>> on CPU utilization as well as space efficiency. > >>>>> > >>>>> I still think we can get most or all of the way there in a generic > >>>>> way by revising the way that we interact with bufferlist for > >>>>> encode and > >> decode. > >>>>> We haven't actually tried to optimize this yet, and the current > >>>>> code is pretty horribly inefficient (asserts all over the place, > >>>>> and many layers of pointer indirection to do a simple append). I > >>>>> think we need to do two > >>>>> things: > >>>>> > >>>>> 1) decode path: optimize the iterator class so that it has a const > >>>>> char *current and const char *current_end that point into the > >>>>> current buffer::ptr. This way any decode will have a single > >>>>> pointer > >>>>> add+comparison to ensure there is enough data to copy before > >>>>> add+falling into > >>>>> the slow path (partial buffer, move to next buffer, etc.). > >>>>> > >>>> > >>>> I don't have a good sense yet for how much this is hurting us in > >>>> the read path. We screwed something up in the last couple of weeks > >>>> and small > >> reads > >>>> are quite slow. > >>>> > >>>>> 2) Having that comparison is still not ideal, but we shoudl > >>>>> consider ways to get around that too. For example, if we know > >>>>> that we are going to decode N M-byte things, we could do an > >>>>> iterator 'reserve' or 'check' that ensures we have a valid pointer > >>>>> for that much and then proceed without checks. The interface here > >>>>> would be tricky, though, since in the slow case we'll span buffers > >>>>> and need to magically fall back to a different decode path (hard > >>>>> to maintain) or do a temporary copy (probably faster but we need > >>>>> to ensure the iterator owns it and frees is later). I'd say this > >>>>> is step 2 and optional; step 1 will have the most > >>>> benefit. > >>>>> > >>>>> 3) encode path: currently all encode methods take a bufferlist& > >>>>> and the bufferlist itself as an append buffer. I think this is > >>>>> flawed and limiting. Instead, we should make a new class called > >>>>> buffer::list::appender (or similar) and templatize the encode > >>>>> methods so they can take a safe_appender (which does bounds > >>>>> checking) or an unsafe_appender (which does not). For the latter, > >>>>> the user takes responsibility for making sure there is enough > >>>>> space by doing a > >>>>> reserve() type call which returns an unsafe_appender, and it's > >>>>> their job to make sure they don't shove too much data into it. > >>>>> That should make the encode path a memcpy + ptr increment (for > >>>>> savvy/optimized > >>>> callers). > >>>> > >>>> Seems reasonable and similar in performance to what Piotr and I > >>>> were discussing this morning. As a very simple test I was thinking > >>>> of doing a > >> quick > >>>> size computation and then passing that in to increase the > >>>> append_buffer > >> size > >>>> when the bufferlist is created in Bluestore::_txc_write_nodes. His > >>>> idea > >> went > >>>> a bit farther to break the encapsulation, compute the fully encoded > >>>> message, and dump it directly into a buffer of a computed size > >>>> without > >> the > >>>> extra assert checks or bounds checking. Obviously his idea would > >>>> be > >> faster > >>>> but more work. > >>>> > >>>> It sounds like your solution would be similar but a bit more formalized. > >>>> > >>>>> > >>>>> I suggest we use bluestore as a test case to make the interfaces > >>>>> work and be fast. If we succeed we can take advantage of it > >>>>> across the reset of the code base as well. > >>>> > >>>> Do we have other places in the code with similar byte append > behavior? > >>>> That's what's really killing us I think, especially with how small > >>>> the new append_buffer is when you run out of space when appending > bytes. > >>>> > >>>>> > >>>>> That's my thinking, at least. I haven't had time to prototype it > >>>>> out yet, but I think our goal should be to make the encode/decode > >>>>> paths capable of being a memcpy + ptr addition in the fast path, > >>>>> and let that guide the interface... > >>>>> > >>>>> sage > >>>>> -- > >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" > >>>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More > >>>> majordomo > >>>>> info at http://vger.kernel.org/majordomo-info.html > >>>>> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html