Re: crimson-osd vs legacy-osd: should the perf difference be already noticeable?

Avi Kivity <avi@xxxxxxxxxxxx> · Wed, 15 Jan 2020 13:22:17 +0200

On 14/01/2020 22.02, Radoslaw Zarzynski wrote:
Hi Avi,

I responded inline to both of your messages.

On 13/01/2020 21.56, Radoslaw Zarzynski wrote:
Well, that's what I would like to avoid. I'd like a developer to know
that if they are developing with the posix stack, the application would
work and work well with the native stack, not that they have to retest
everything.
I see your point. It's valid one.

That memcpy would be incurred only if the state machine specified it
needed its own buffer. If it specified it can run from a stack-provided
buffer, it would still be zero copy.
If we could make this decision at run-time, then fine. I'm afraid
application has currently no way to determine which stack is to
be used and adjust the choice dynamically. If so, supporting both
POSIX and native efficiently (with no trade-offs) would boil down
into a compile-time decision, and thus separated builds.
Still, twice the testing. :-(

I don't understand why you say this. If Seastar performs the memcpy 
transparently when needed, why do you need separate builds?

What your proposal does is reuse the read(2) call's copy to userspace
for the application's purposes - the copy is still there. So the native
stack isn't handicapped by this copy, it happens in both.
This assumes that crimson will always need to memcpy() the data
retrieved from a network stack. I believe that's not the case.

For the kernel drivers (POSIX stack + kernel's storage)
the read() in Seastar is actually reused to provide kernel with
an opportunity to remap pages, instead of doing memcpy(),
when the retrieved payload is written to storage. This could
happen as the buffer Seastar read into had been properly
aligned.

Apart from the read() itself there is no inherent memcpy() on

This is the memcpy I was referring to. And the solution I'd like to see 
is one where the application tells seastar what buffers it wants to see 
the data in, which allows seastar to either direct read() to copy into 
those buffers (using the copy it does anyway) or perform the copy itself 
(for the native stack case).

the data path. When flowing through it, the payload is always
conveyed as ref-counted scatter-gather list. This stays even
if this SGL has only single segment like when using POSIX
with the particular implementation of input_buffer_factory.

For the user-space drivers (native stack + SPDK) there is a chance
to squeeze memcpy() / remappings entirely. Of course, this
assumes that storage HW is able to deal with inflated SGLs
efficiently. I hope vendors could throw more light on that.

If the storage indicates it doesn't need alignment, then the application 
avoids telling seastar to read into the application's buffers and 
instead accepts the current stack-provided temporary_buffers.

 From our last discussion I recall your point about the impact
on cache density (in the meaning of e.g. BlueStore's cache)
ref-counting can impose. For sure segments of our SGL will be
bigger than the actual payload and contain metadata or other
junks.
My answer would be that application's caching policy shouldn't
belong to network layer. It has too little information to judge
whether a given buffer needs to be cached or not. There are
many cases when you won't cache. To exemplify while staying
in the BlueStore's domain:  `bluestore_default_buffered_write`
is `false` by default. That is, BlueStore **doesn't cache on writes**.

If you can decide in advance you need to cache, provide the 
cache-friendly buffers to seastar and it will make sure the data lands 
there (either through the kernel's read() or through its own memcpy).

If you can't make that decision in advance, and you also don't need 
alignment for other reasons, then accept seastar's non-aligned buffers 
and perform the adjustment yourself if it is later needed.

Is there a case I missed?

On Tue, Jan 14, 2020 at 12:16 PM Avi Kivity <avi@xxxxxxxxxxxx> wrote:
The default placing_data_source would just copy data from the original
data_source to the buffers provided by the user.

The placing_data_source provided by
posix_data_source_impl::to_placing_data_source() would cooperate with
the posix stack to read directly into the buffers provided by the user
(reusing the copy performed by the system call).

If a data movement engine is available, the native stack might program
it to perform the copy.
Well, this is about off-loading the memcpy() we would introduce
to the native stack. I still think the best – from the performance's
point-of-view – is to avoid this shuffling at all. The sacrifice for
that would the maintainability concern coming from differentiated
paths in the network layer.

If you know you don't need the memcpy, don't provide your pre-allocated 
buffers and it won't happen.

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx