Re: crimson-osd vs legacy-osd: should the perf difference be already noticeable?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Avi!

I apologize for the late response and want to thank you for the input.

On Wed, Jan 15, 2020 at 5:24 PM Avi Kivity <avi@xxxxxxxxxxxx> wrote:
> Ok, so it's not just about alignment, but also about sizes. We can also
> allow the application to specify how many bytes it wants to read (in
> fact, it can already do that with read_exactly, but input_stream does
> not pass the information along).

Hmm, I believe we need to differentiate between the size of single
linearization request (the size_t instance passed to e.g. read_exactly
in order to get e.g. flat header) and the more general size for the buffer-
to-produce-by-stack. The latter would be acquired on the same layer
as application-provided buffers. However, to not impose extensive
memcpy, the "general size" would need to be basically a hint possible
to ignore by the native stack (just like the input_buffer_factory was).

> Let's list the possible cases:
>
> - the application knows nothing (common when parsing a complex stream
> containing small objects). This is where Scylla is, similar to an HTTP
> server.
>
> - the protocol has rigid structure (fixed size header + variable
> payload). The application wants the header in a linearized buffer and
> the payload in a free-form iovec. This corresponds to cyanstore.

The Ceph's on-write protocol divides payload into a sequence of
segments. Only one segment may have alignment hint. There is
also fixed-size epilogue as the message's tail. Still, those details
don't seem to mess much here.

> - the protocol has rigid structure as above. The application wants the
> header in a linearized buffer and the payload in its own buffers due to
> alignment or ownership requirements. This corresponds to a production
> storage server that has alignment requirements for talking to storage
> and ownership/placement requirements for caching blocks.
>
> [..]
>
> Is this a good set of capabilities to provide?

For the POSIX stack case there is an extra performance requirement:
1 syscall per (not-too-big) message *on average*. It's obtainable as
reading of message N's payload can be combined with N + 1's header.
The issue with "current linearization size" vs "buffer-to-produce-by-
stack size" comes from here. Without the ability to ignore the latter
one in native, crimson would basically always need to linearize entire
message to get the best performance from POSIX. However, this
would hurt the native with excessive memcpy. :-(

When the placement requirement comes to play, 1 read() / msg would
translate into aligning-to-the-middle in the application-provided buffer
(already implemented in the crimson's IBF for the coming "sea store").
The alternative can be the readv().

Likely an application might also want to prefetch to amortize syscall
costs for tiny messages. Still, if it could get control over buffer
allocation / size hint, this thing doesn't look impossible.


> If it is, then we can implement "linearizes" differently for each stack,
> and also depending on whether the buffer is provided by the user or the
> stack.

Yes, this will provide stacks with the memcpy offload possibility.

> For buffers provided by the stack (which there is only a linearization
> requirement, not a placement requirement):
>
>   - posix allocates a buffer and issues read() syscalls until the buffer
> is full
>
>   - native will attempt to temporary_buffer::share() the buffer if it
> fits into a packet, and allocate and copy if it does not

That's the painful place. For the sake of the native's efficiency,
linearization should be limited only to small fragments. I'm afraid
this turns the linearization sizes into:
  * optional things, basically hints (like the input_buffer_factory was) or
  * obligatory parameters controllable on the isNative() / isPosix basis.

Regards,
Radek
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx



[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux