On 17-07-06 04:40 PM, Jason Dillaman wrote:
On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> wrote:
So I really see two problems here: lack of API docs and
backwards-incompatible change in API behavior.
Docs are always in need of update, so any pull requests would be
greatly appreciated.
However, I disagree that the behavior has substantively changed -- it
was always possible for pre-Luminous to (sometimes) copy the buffer
before the "rbd_aio_write" method completed.
But that copy was buried somewhere deep in the librbd internals and -
looking at Jewel version - most would assume that it's not really copied and
user is responsible for keeping buffer intact until write is complete. API
user doesn't really care about what's going on internally and is beyond
their control.
With Luminous, this
behavior is more consistent -- but in a future release memory may be
zero-copied. If your application can properly conform to the
(unwritten) contract that the buffers should remain unchanged, there
would be no need for the application to pre-copy the buffers.
So far I am forced to do a copy anyway (see below). The question is whether
it's me doing it, or librbd. It doesn't make sense to have it both do the
same -- especially if it's going to handle tens of terabytes of data, which
could mean for 10TB of data at least 83 886 080 memory allocations, releases
and copies plus 2 684 354 560 page faults (assuming 4KB pages) -- and these
are the best case scenario numbers assuming 128KB I/O size. What I
understand that you expect from me, is to have at least number of memory
copies doubled and push not "just" 20TB over the memory bus (reading 10TB
from one buffer and writing these 10TB to another), but 40.
In other words, if I'd write my code considering how Jewel librbd works,
there would be no real issue, apart from the fact that suddenly my program
would consume more memory and would burn more CPU cycles once librbd is
upgraded to Luminous which, considering the amount of data, would be
noticeable change.
If the libfuse implementation requires that the memory is not-in-use
by the time you return control to it (i.e. it's a synchronous API and
you are using async methods), you will always need to copy it.
Yes, libfuse expects that once I leave entrypoint, it is free to do anything
it wishes with previously provided buffers -- and that's what it actually does.
> The C++
> API allows you to control the copying since you need to pass
> "bufferlist"s to the API methods and since they utilize a reference
> counter, there is no internal copying within librbd / librados.
How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy
the buffer with the assumption that it won't change) and instead of
constructing bufferlist containing bufferptr to copied data, construct a
bufferlist containing bufferptr made with create_static(user_buffer)?
--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html