Re: Note about rbd_aio_write usage

Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> · Thu, 6 Jul 2017 17:46:04 +0200

On 17-07-06 04:40 PM, Jason Dillaman wrote:
On Thu, Jul 6, 2017 at 10:22 AM, Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> wrote:
So I really see two problems here: lack of API docs and
backwards-incompatible change in API behavior.

Docs are always in need of update, so any pull requests would be
greatly appreciated.

However, I disagree that the behavior has substantively changed -- it
was always possible for pre-Luminous to (sometimes) copy the buffer
before the "rbd_aio_write" method completed.

But that copy was buried somewhere deep in the librbd internals and - 
looking at Jewel version - most would assume that it's not really copied and 
user is responsible for keeping buffer intact until write is complete. API 
user doesn't really care about what's going on internally and is beyond 
their control.

With Luminous, this
behavior is more consistent -- but in a future release memory may be
zero-copied. If your application can properly conform to the
(unwritten) contract that the buffers should remain unchanged, there
would be no need for the application to pre-copy the buffers.

So far I am forced to do a copy anyway (see below). The question is whether 
it's me doing it, or librbd. It doesn't make sense to have it both do the 
same -- especially if it's going to handle tens of terabytes of data, which 
could mean for 10TB of data at least 83 886 080 memory allocations, releases 
and copies plus 2 684 354 560 page faults (assuming 4KB pages) -- and these 
are the best case scenario numbers assuming 128KB I/O size. What I 
understand that you expect from me, is to have at least number of memory 
copies doubled and push not "just" 20TB over the memory bus (reading 10TB 
from one buffer and writing these 10TB to another), but 40.
In other words, if I'd write my code considering how Jewel librbd works, 
there would be no real issue, apart from the fact that suddenly my program 
would consume more memory and would burn more CPU cycles once librbd is 
upgraded to Luminous which, considering the amount of data, would be 
noticeable change.

If the libfuse implementation requires that the memory is not-in-use
by the time you return control to it (i.e. it's a synchronous API and
you are using async methods), you will always need to copy it.
Yes, libfuse expects that once I leave entrypoint, it is free to do anything 
it wishes with previously provided buffers -- and that's what it actually does.

> The C++
> API allows you to control the copying since you need to pass
> "bufferlist"s to the API methods and since they utilize a reference
> counter, there is no internal copying within librbd / librados.

How about a hybrid solution? Keep the old rbd_aio_write contract (don't copy 
the buffer with the assumption that it won't change) and instead of 
constructing bufferlist containing bufferptr to copied data, construct a 
bufferlist containing bufferptr made with create_static(user_buffer)?

--
Piotr Dałek
piotr.dalek@xxxxxxxxxxxx
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com