On Tue, 12 Jul 2016, Allen Samuels wrote: > Good analysis. > > My original comments about putting the oNode on a diet included the idea > of a "custom" encode/decode path for certain high-usage cases. At the > time, Sage resisted going down that path hoping that a more optimized > generic case would get the job done. Your analysis shows that while > we've achieved significant space reduction this has come at the expense > of CPU time -- which dominates small object performance (I suspect that > eventually we'd discover that the variable length decode path would be > responsible for a substantial read performance degradation also -- which > may or may not be part of the read performance drop-off that you're > seeing). This isn't a surprising result, though it is unfortunate. > > I believe we need to revisit the idea of custom encode/decode paths for > high-usage cases, only now the gains need to be focused on CPU > utilization as well as space efficiency. I still think we can get most or all of the way there in a generic way by revising the way that we interact with bufferlist for encode and decode. We haven't actually tried to optimize this yet, and the current code is pretty horribly inefficient (asserts all over the place, and many layers of pointer indirection to do a simple append). I think we need to do two things: 1) decode path: optimize the iterator class so that it has a const char *current and const char *current_end that point into the current buffer::ptr. This way any decode will have a single pointer add+comparison to ensure there is enough data to copy before falling into the slow path (partial buffer, move to next buffer, etc.). 2) Having that comparison is still not ideal, but we shoudl consider ways to get around that too. For example, if we know that we are going to decode N M-byte things, we could do an iterator 'reserve' or 'check' that ensures we have a valid pointer for that much and then proceed without checks. The interface here would be tricky, though, since in the slow case we'll span buffers and need to magically fall back to a different decode path (hard to maintain) or do a temporary copy (probably faster but we need to ensure the iterator owns it and frees is later). I'd say this is step 2 and optional; step 1 will have the most benefit. 3) encode path: currently all encode methods take a bufferlist& and the bufferlist itself as an append buffer. I think this is flawed and limiting. Instead, we should make a new class called buffer::list::appender (or similar) and templatize the encode methods so they can take a safe_appender (which does bounds checking) or an unsafe_appender (which does not). For the latter, the user takes responsibility for making sure there is enough space by doing a reserve() type call which returns an unsafe_appender, and it's their job to make sure they don't shove too much data into it. That should make the encode path a memcpy + ptr increment (for savvy/optimized callers). I suggest we use bluestore as a test case to make the interfaces work and be fast. If we succeed we can take advantage of it across the reset of the code base as well. That's my thinking, at least. I haven't had time to prototype it out yet, but I think our goal should be to make the encode/decode paths capable of being a memcpy + ptr addition in the fast path, and let that guide the interface... sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html