On Thu, 12 Jan 2017, Piotr Dałek wrote: > On 01/11/2017 07:01 PM, Sage Weil wrote: > > On Wed, 11 Jan 2017, Jason Dillaman wrote: > > > On Wed, Jan 11, 2017 at 11:44 AM, Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> > > > wrote: > > > > As the subject says - are here any users/consumers of librados C API? > > > > I'm > > > > asking because we're researching if this PR: > > > > https://github.com/ceph/ceph/pull/12216 will be actually beneficial for > > > > larger group of users. This PR adds a bunch of new APIs that perform > > > > object > > > > writes without intermediate data copy, which will reduce cpu and memory > > > > load > > > > on clients. If you're using librados C API for object writes, feel free > > > > to > > > > comment here or in the pull request. > > > +1 > > > > > > I'd be happy to tweak the internals of librbd to support pass-through > > > of C buffers all the way to librados. librbd clients like QEMU use the > > > C API and this currently results in several extra copies (in librbd > > > and librados). > > > > +1 from me too. > > > > The caveat is that we have to be very careful with buffers that are > > provided by users. Currently the userspace messenger code doesn't provide > > a way to manage the provenance of references to the buffer::raw_static > > buffers, which means that even if the write has completed, there may be > > ways for an MOSDOp to still be alive that references that memory. > > > > Either (1) we have to audit the code to be sure that by the time the > > Objecter request completes we know that all messages and their bufferlists > > are cleared (tricky/fragile), or (2) introduce some buffer management > > interface in librados so that the buffer lifecycle is independent of the > > request. I would prefer (2), but it means the interfaces would be > > something like > > > > rados_buffer_create(...) > > copy your data into that buffer > > rados_write(...) or whatever > > rados_buffer_release(...) > > > > and then rados can do the proper refcounting and only deallocate the > > memory when all refs have gone away. Unfortunately, I suspect that there > > is a largish category of users where this isn't sufficient... e.g., if > > some existing C user has its own buffer and it isn't practical to > > allocate/release via rados_buffer_* calls instead of malloc/free (or > > whatever). > > Personally I vote for (1). That way is time consuming, but may help us find > other ways to optimize resource consumption of API itself. In particular, if > librbd, as you all wrote, does several copies of user buffers, then maybe it's > worth spending some time on figuring out the actual lifetime of buffers? > Maybe, for example, buffers in MOSDOp are actually copies of already copied > buffers, so the extra copy done by librados is unnecessary? > Regardless, the idea behind my PR wasn't to make librados/librbd/libwhatever > 100% optimal right away, but to make one step towards it being efficient - > recduce cpu and memory usage a bit, have it go through testing, if nothing > breaks, try harder until we're close to perfection. > even if few percent decrease in memory consumption doesn't look interesting on > single-client level, few percents in large scale deployments may mean hundreds > of gigabytes of memory that could be put to better use. > > (2), and particularly the requirement for API changes, along with need to put > user data into specific RADOS structure doesn't look good because > - that requires way more changes to user code, and I don't expect any of > current user to be interested in that (unless we force it with API change) > - most of devs are simply lazy and instead of using "rados_buffer_create" to > put their data directly into them, they'll keep using their own buffers and > *then* copy them into RADOS buffers, just like we're doing it right now (even > if behind the scenes). Yeah, I think you're right. The bad news is just that (1) is hard. It's going to require a few things in order to address the problem of sending multiple MOSDOp requests for the same operation (e.g., after peering). I think it will mean - The Message superclass is going to require a new Mutex that is taken by the messenger implementation while accessing the encoded message payload buffer - It will also need a revoke_buffers() method (that takes the mutex) so that Objecter can yank references to buffers for any messages in flight that it no longer cares about - Objecter will need to keep a ref of the in-flight request of record. If it needs to resend to a different OSD, it can use that ref to revoke the buffers. - Once the operation completes, Objecter can drop it's ref to that Message. Until we do that, any zero-copy changes like those in the PR will mostly work but trigger use-after-free is various cases where the PG mappings are changing. sage