Just for labelling, I'll invent the names fast_encode and fast_decode as labels for the new scheme. Fundamentally, the problem that we have is lack of knowledge about how much data is going to be encoded (and to a lesser extent, decoded). Currently, we basically check for space on each micro-encode operation. Clearly those checks have to be eliminated. That leaves us with basically three choices: (1) Pre-allocate a buffer sufficiently large. (2) Check for sufficient space at "strategic" places in the code. (3) Fiddle with the CPU memory mapping tables to generate a page fault when we run off the end of our buffer (which clearly must be page aligned, blah blah blah) and then contrive to auto-allocate and restart. I reject (3) as too complex and not worth the effort. I reject (2) as difficult to maintain and difficult to determine where the "strategic" points in the code are that retain correctness, but happen infrequently enough to minimize CPU utilization. That leaves us with (1). Method (1) must consist of making a worst-case prediction of the size of the encoded data. Encoded the data into the allocated buffer and then "freeing" the unused portion (since tight encoding of the data is content-dependent in length). The encode and free processes are relatively straightforward, but the prediction process has some options to explore. (1a) Constant "maximum". We could easily establish something like a 128K or 256K constant upper limit and just use a buffer of that size. (1b) computed "maximum", For simple objects (a few fields/containers) it's relatively easy to add up the necessary sizes to generate an estimate. For complex objects, this is error prone because you're duplicating the encode with the estimate-size logic. Right now, I'm trying to concoct some combinations of C++ templates that's lets me merge the estimate-size, encode and decode functions into a common routine so that we can avoid this system-matic error. Stay tuned. One of the dangers of the prediction scheme is what if the prediction is incorrect -- too small. Then you'll get buffer overrun. It's been suggested that we insert some special debug-assert code to detect that situation which is only enabled at compile-time. I believe this is NOT the right solution. The buffer overrun problem is data dependent. That means it will be especially hard to debug in the field as what you're looking for is essentially silent data corruption. I believe that the best solution is to simply check the fully-encoded buffer against the estimate that was made. If the estimate is too small, then assert-out. Leave this in production code. If we encounter an encode-buffer overrun at least we'll now that was the source of the problem and we can fix it. (I assert that if you see this, it'll be pretty obvious where it went wrong -- especially if I'm able to create a unified estimate-size, encode and decode function. One last problem to solve is the decode problem. We need to know how much data is in a fast-encode buffer in order to ensure that it's not fragmented in the buffer::list. This is relatively easy if the encode leaves a "total bytes" field at the start of the operation. Now we can see the pseudo-code for fast-encode. Void fast_encode(object& o, bufferptr& b) { size_t estimate = o.estimate_sizeof_fast_encode() + sizeof(int); // Extra int is for explicit size of the overall buffer. char * buf_start = b.push_back (estimate); // returns pointer to a block of memory appended to the end that's of the specified size; char * buf_end = o.do_fast_encode(buf_start + sizeof(int)); // starts serialization into the address pointed at by the input parameter, returns "next" pointer, i.e., pointer to next unused byte size_t consumed = buf_end - buf_start; // Compute consumed bytes assert(consumed <= estimate); // Here's where we catch a silent data corruption due to an overflow *(int *)buf_start = consumed; // Total size of consumed buffer, including the starting int stored back at the start... size_t unused = estimate - consumed; // Amount of unused space at the end. b.pop_back(unused); // "free" the unused bytes at the end of the buffer. } Now we can see the shape of the fast_decode function. Void fast_decode(object& o, bufferptr& b) { Int encode_size = b.pop_front_int(); // Remove first bytes where the encode left the size of the full buffer const char *buf_start = b.contiguous_ptr(encode_size - sizeof(int)); // return pointer to sequential buffer of specified number of bytes <<<<- here's where we might have to copy discontiguous buffers into a single buffer. const char *buf_end = o.fast_decode(buf_start); // Decode data from the buffer into the object, return the pointer to the last consumed byte. Size_t consumed_bytes = buf_end - buf_start; Assert(consumed_bytes == encode_size); // Consistency check, we consumed exactly as much as we encoded (NB, I may have left off a " + sizeof(int)" for the initial buffer size.) b.pop_front(consumed_bytes); // logically remove the bytes that we've consumed } Allen Samuels SanDisk |a Western Digital brand 2880 Junction Avenue, Milpitas, CA 95134 T: +1 408 801 7030| M: +1 408 780 6416 allen.samuels@xxxxxxxxxxx > -----Original Message----- > From: Sage Weil [mailto:sweil@xxxxxxxxxx] > Sent: Wednesday, July 13, 2016 9:06 AM > To: Piotr Dałek <branch@xxxxxxxxxxxxxxxx> > Cc: Mark Nelson <mnelson@xxxxxxxxxx>; Allen Samuels > <Allen.Samuels@xxxxxxxxxxx>; ceph-devel <ceph-devel@xxxxxxxxxxxxxxx> > Subject: Re: bluestore onode diet and encoding overhead > > On Wed, 13 Jul 2016, Piotr Dałek wrote: > > On Tue, Jul 12, 2016 at 10:13:14PM -0500, Mark Nelson wrote: > > > > > > > > > On 07/12/2016 08:50 PM, Sage Weil wrote: > > > >On Tue, 12 Jul 2016, Allen Samuels wrote: > > > >>[..] > > > >>I believe we need to revisit the idea of custom encode/decode > > > >>paths for high-usage cases, only now the gains need to be focused > > > >>on CPU utilization as well as space efficiency. > > > > > > > >I still think we can get most or all of the way there in a generic > > > >way by revising the way that we interact with bufferlist for encode and > decode. > > > >We haven't actually tried to optimize this yet, and the current > > > >code is pretty horribly inefficient (asserts all over the place, > > > >and many layers of pointer indirection to do a simple append). I > > > >think we need to do two > > > >things: > > > > > > > >1) decode path: optimize the iterator class so that it has a const > > > >char *current and const char *current_end that point into the > > > >current buffer::ptr. This way any decode will have a single > > > >pointer > > > >add+comparison to ensure there is enough data to copy before > > > >add+falling into > > > >the slow path (partial buffer, move to next buffer, etc.). > > > > > > I don't have a good sense yet for how much this is hurting us in the > > > read path. We screwed something up in the last couple of weeks and > > > small reads are quite slow. > > > > The main issue with decode using bufferlist is that we cannot assume > > anything regarding internal data memory layout. We can't do anything > > like > > > > int j = *((int*) bufferptr.c_str()); > > > > because bufferptr may be too short for "j" to be read in one go. > > Possible solution would be to ensure that bufferptrs contain > > contiguous blocks large enough to store one unit of data (be it, for > > example, onode info). > > > > > >2) Having that comparison is still not ideal, but we shoudl > > > >consider ways to get around that too. For example, if we know that > > > >we are going to decode N M-byte things, we could do an iterator > > > >'reserve' or 'check' that ensures we have a valid pointer for that > > > >much and then proceed without checks. The interface here would be > > > >tricky, though, since in the slow case we'll span buffers and need > > > >to magically fall back to a different decode path (hard to > > > >maintain) or do a temporary copy (probably faster but we need to > > > >ensure the iterator owns it and frees is later). I'd say this is step 2 and > optional; step 1 will have the most benefit. > > > > Exactly my point. > > Regarding a copy, we could just do something like rebuild_contiguous() > > and make sure bufferlist is a one, large bufferptr or it is split on > > logical data unit boundary instead of random places that are > > messenger/underyling I/O store dependent. That will take care of both > > memory ownership and bufferlist continuity. > > In practice, it is pretty much always contiguous: msgr allocates a whole chunk, > and when we read stuff off disk or out of kv store it is one chunk. > Pretty much the only time it isn't is when you just encoded it... and we > generaly don't decode in that case. > > Anyway, the point is that we can do something pretty simple and non- > optimal in the non-contiguous case (like rebuild()) and it shouldn't really > matter. > > > > >3) encode path: currently all encode methods take a bufferlist& and > > > >the bufferlist itself as an append buffer. I think this is flawed > > > >and limiting. Instead, we should make a new class called > > > >buffer::list::appender (or similar) and templatize the encode > > > >methods so they can take a safe_appender (which does bounds > > > >checking) or an unsafe_appender (which does not). For the latter, > > > >the user takes responsibility for making sure there is enough space > > > >by doing a reserve() type call which returns an unsafe_appender, > > > >and it's their job to make sure they don't shove too much data into > > > >it. That should make the encode path a memcpy + ptr increment (for > savvy/optimized callers). > > > > > > Seems reasonable and similar in performance to what Piotr and I were > > > discussing this morning. As a very simple test I was thinking of > > > doing a quick size computation and then passing that in to increase > > > the append_buffer size when the bufferlist is created in > > > Bluestore::_txc_write_nodes. His idea went a bit farther to break > > > the encapsulation, compute the fully encoded message, and dump it > > > directly into a buffer of a computed size without the extra assert > > > checks or bounds checking. Obviously his idea would be faster but > > > more work. > > > > > > It sounds like your solution would be similar but a bit more formalized. > > > > I like the idea, because that way we could add extra checks to debug > > builds (added via preprocessor define) and have the ability to find > > bugs easier, retaining performance on release/optimized builds. > > Or we could go that way all-in, have single kind of appender and do > > bounds check only on debug builds. > > Yeah--I'd go for debug asserts that compile out of release builds. (We should > either create a new assert macro, or go do teh work to change current > assert()'s to ceph_assert_always() or whatever.) > > > > >I suggest we use bluestore as a test case to make the interfaces > > > >work and be fast. If we succeed we can take advantage of it across > > > >the reset of the code base as well. > > > > > > Do we have other places in the code with similar byte append > > > behavior? That's what's really killing us I think, especially with > > > how small the new append_buffer is when you run out of space when > > > appending bytes. > > > > I still recommend doing a dry bench and measure how fast > > _txc_write_nodes is. We may be spending a lot of time optimizing > > bufferlist API, when the real issue lies in another place. Creation > > and destruction of bufferlist is one of my point of concerns in > > _txc_write_nodes, but then, I have no clue on how many times that > happen per entire call. > > If the following > > > > // (KeyValueDB::Transaction t, set<OnodeRef>::iterator p) > > t->set(PREFIX_OBJ, (*p)->key, bl); > > > > performs synchronously (i.e. doesn't do anything else with bl contents > > after leaving t->set call), we could just rewind bl and start > > overwriting it. In worst case, we would alloc extra memory, which is > > still one case out of 3 possible (remaining being new onode/bnode of > > the same or smaller sizes than previous one). > > Almost certain it does make a copy of the buffer. Maybe a bufferlist method > that clears ptr list but keeps one of them as the append buffer if it has no > more references. clear_keep_append() or something. > > sage ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f