There is option (3) which is to have a new (or modified) "buffer::create_static" take an optional callback to invoke when the buffer::raw object is destructed. The raw pointer would be destructed when the last buffer::ptr / buffer::list containing it is destructed, so you know it's no longer being referenced. You could then have the new C API methods that wrap the C buffer in a bufferlist and set a new flag in the librados::AioCompletion to delay its completion until after it's both completed and the memory is released. When the buffer is freed, the callback would unblock the librados::AioCompltion completion callback. On Thu, Jan 12, 2017 at 8:48 AM, Sage Weil <sweil@xxxxxxxxxx> wrote: > On Thu, 12 Jan 2017, Piotr Dałek wrote: >> On 01/11/2017 07:01 PM, Sage Weil wrote: >> > On Wed, 11 Jan 2017, Jason Dillaman wrote: >> > > On Wed, Jan 11, 2017 at 11:44 AM, Piotr Dałek <piotr.dalek@xxxxxxxxxxxx> >> > > wrote: >> > > > As the subject says - are here any users/consumers of librados C API? >> > > > I'm >> > > > asking because we're researching if this PR: >> > > > https://github.com/ceph/ceph/pull/12216 will be actually beneficial for >> > > > larger group of users. This PR adds a bunch of new APIs that perform >> > > > object >> > > > writes without intermediate data copy, which will reduce cpu and memory >> > > > load >> > > > on clients. If you're using librados C API for object writes, feel free >> > > > to >> > > > comment here or in the pull request. >> > > +1 >> > > >> > > I'd be happy to tweak the internals of librbd to support pass-through >> > > of C buffers all the way to librados. librbd clients like QEMU use the >> > > C API and this currently results in several extra copies (in librbd >> > > and librados). >> > >> > +1 from me too. >> > >> > The caveat is that we have to be very careful with buffers that are >> > provided by users. Currently the userspace messenger code doesn't provide >> > a way to manage the provenance of references to the buffer::raw_static >> > buffers, which means that even if the write has completed, there may be >> > ways for an MOSDOp to still be alive that references that memory. >> > >> > Either (1) we have to audit the code to be sure that by the time the >> > Objecter request completes we know that all messages and their bufferlists >> > are cleared (tricky/fragile), or (2) introduce some buffer management >> > interface in librados so that the buffer lifecycle is independent of the >> > request. I would prefer (2), but it means the interfaces would be >> > something like >> > >> > rados_buffer_create(...) >> > copy your data into that buffer >> > rados_write(...) or whatever >> > rados_buffer_release(...) >> > >> > and then rados can do the proper refcounting and only deallocate the >> > memory when all refs have gone away. Unfortunately, I suspect that there >> > is a largish category of users where this isn't sufficient... e.g., if >> > some existing C user has its own buffer and it isn't practical to >> > allocate/release via rados_buffer_* calls instead of malloc/free (or >> > whatever). >> >> Personally I vote for (1). That way is time consuming, but may help us find >> other ways to optimize resource consumption of API itself. In particular, if >> librbd, as you all wrote, does several copies of user buffers, then maybe it's >> worth spending some time on figuring out the actual lifetime of buffers? >> Maybe, for example, buffers in MOSDOp are actually copies of already copied >> buffers, so the extra copy done by librados is unnecessary? >> Regardless, the idea behind my PR wasn't to make librados/librbd/libwhatever >> 100% optimal right away, but to make one step towards it being efficient - >> recduce cpu and memory usage a bit, have it go through testing, if nothing >> breaks, try harder until we're close to perfection. >> even if few percent decrease in memory consumption doesn't look interesting on >> single-client level, few percents in large scale deployments may mean hundreds >> of gigabytes of memory that could be put to better use. >> >> (2), and particularly the requirement for API changes, along with need to put >> user data into specific RADOS structure doesn't look good because >> - that requires way more changes to user code, and I don't expect any of >> current user to be interested in that (unless we force it with API change) >> - most of devs are simply lazy and instead of using "rados_buffer_create" to >> put their data directly into them, they'll keep using their own buffers and >> *then* copy them into RADOS buffers, just like we're doing it right now (even >> if behind the scenes). > > Yeah, I think you're right. The bad news is just that (1) is hard. It's > going to require a few things in order to address the problem of sending > multiple MOSDOp requests for the same operation (e.g., after peering). I > think it will mean > > - The Message superclass is going to require a new Mutex that is taken by > the messenger implementation while accessing the encoded message payload > buffer > > - It will also need a revoke_buffers() method (that takes the mutex) so > that Objecter can yank references to buffers for any messages in flight > that it no longer cares about > > - Objecter will need to keep a ref of the in-flight request of record. If > it needs to resend to a different OSD, it can use that ref to revoke the > buffers. > > - Once the operation completes, Objecter can drop it's ref to that > Message. > > Until we do that, any zero-copy changes like those in the PR will mostly > work but trigger use-after-free is various cases where the PG mappings are > changing. > > sage -- Jason -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html