Hi Avi! I apologize for the late response and want to thank you for the input. On Wed, Jan 15, 2020 at 5:24 PM Avi Kivity <avi@xxxxxxxxxxxx> wrote: > Ok, so it's not just about alignment, but also about sizes. We can also > allow the application to specify how many bytes it wants to read (in > fact, it can already do that with read_exactly, but input_stream does > not pass the information along). Hmm, I believe we need to differentiate between the size of single linearization request (the size_t instance passed to e.g. read_exactly in order to get e.g. flat header) and the more general size for the buffer- to-produce-by-stack. The latter would be acquired on the same layer as application-provided buffers. However, to not impose extensive memcpy, the "general size" would need to be basically a hint possible to ignore by the native stack (just like the input_buffer_factory was). > Let's list the possible cases: > > - the application knows nothing (common when parsing a complex stream > containing small objects). This is where Scylla is, similar to an HTTP > server. > > - the protocol has rigid structure (fixed size header + variable > payload). The application wants the header in a linearized buffer and > the payload in a free-form iovec. This corresponds to cyanstore. The Ceph's on-write protocol divides payload into a sequence of segments. Only one segment may have alignment hint. There is also fixed-size epilogue as the message's tail. Still, those details don't seem to mess much here. > - the protocol has rigid structure as above. The application wants the > header in a linearized buffer and the payload in its own buffers due to > alignment or ownership requirements. This corresponds to a production > storage server that has alignment requirements for talking to storage > and ownership/placement requirements for caching blocks. > > [..] > > Is this a good set of capabilities to provide? For the POSIX stack case there is an extra performance requirement: 1 syscall per (not-too-big) message *on average*. It's obtainable as reading of message N's payload can be combined with N + 1's header. The issue with "current linearization size" vs "buffer-to-produce-by- stack size" comes from here. Without the ability to ignore the latter one in native, crimson would basically always need to linearize entire message to get the best performance from POSIX. However, this would hurt the native with excessive memcpy. :-( When the placement requirement comes to play, 1 read() / msg would translate into aligning-to-the-middle in the application-provided buffer (already implemented in the crimson's IBF for the coming "sea store"). The alternative can be the readv(). Likely an application might also want to prefetch to amortize syscall costs for tiny messages. Still, if it could get control over buffer allocation / size hint, this thing doesn't look impossible. > If it is, then we can implement "linearizes" differently for each stack, > and also depending on whether the buffer is provided by the user or the > stack. Yes, this will provide stacks with the memcpy offload possibility. > For buffers provided by the stack (which there is only a linearization > requirement, not a placement requirement): > > - posix allocates a buffer and issues read() syscalls until the buffer > is full > > - native will attempt to temporary_buffer::share() the buffer if it > fits into a packet, and allocate and copy if it does not That's the painful place. For the sake of the native's efficiency, linearization should be limited only to small fragments. I'm afraid this turns the linearization sizes into: * optional things, basically hints (like the input_buffer_factory was) or * obligatory parameters controllable on the isNative() / isPosix basis. Regards, Radek _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx