Re: [ceph-users] Any librados C API users out there?

Sage Weil <sweil@xxxxxxxxxx> · Thu, 12 Jan 2017 13:48:26 +0000 (UTC)

On Thu, 12 Jan 2017, Piotr Dałek wrote:
> On 01/11/2017 07:01 PM, Sage Weil wrote:
> > On Wed, 11 Jan 2017, Jason Dillaman wrote:
> > > On Wed, Jan 11, 2017 at 11:44 AM, Piotr Dałek <piotr.dalek@xxxxxxxxxxxx>
> > > wrote:
> > > > As the subject says - are here any users/consumers of librados C API?
> > > > I'm
> > > > asking because we're researching if this PR:
> > > > https://github.com/ceph/ceph/pull/12216 will be actually beneficial for
> > > > larger group of users. This PR adds a bunch of new APIs that perform
> > > > object
> > > > writes without intermediate data copy, which will reduce cpu and memory
> > > > load
> > > > on clients. If you're using librados C API for object writes, feel free
> > > > to
> > > > comment here or in the pull request.
> > > +1
> > > 
> > > I'd be happy to tweak the internals of librbd to support pass-through
> > > of C buffers all the way to librados. librbd clients like QEMU use the
> > > C API and this currently results in several extra copies (in librbd
> > > and librados).
> > 
> > +1 from me too.
> > 
> > The caveat is that we have to be very careful with buffers that are
> > provided by users.  Currently the userspace messenger code doesn't provide
> > a way to manage the provenance of references to the buffer::raw_static
> > buffers, which means that even if the write has completed, there may be
> > ways for an MOSDOp to still be alive that references that memory.
> > 
> > Either (1) we have to audit the code to be sure that by the time the
> > Objecter request completes we know that all messages and their bufferlists
> > are cleared (tricky/fragile), or (2) introduce some buffer management
> > interface in librados so that the buffer lifecycle is independent of the
> > request.  I would prefer (2), but it means the interfaces would be
> > something like
> > 
> >  rados_buffer_create(...)
> >  copy your data into that buffer
> >  rados_write(...) or whatever
> >  rados_buffer_release(...)
> > 
> > and then rados can do the proper refcounting and only deallocate the
> > memory when all refs have gone away.  Unfortunately, I suspect that there
> > is a largish category of users where this isn't sufficient... e.g., if
> > some existing C user has its own buffer and it isn't practical to
> > allocate/release via rados_buffer_* calls instead of malloc/free (or
> > whatever).
> 
> Personally I vote for (1). That way is time consuming, but may help us find
> other ways to optimize resource consumption of API itself. In particular, if
> librbd, as you all wrote, does several copies of user buffers, then maybe it's
> worth spending some time on figuring out the actual lifetime of buffers?
> Maybe, for example, buffers in MOSDOp are actually copies of already copied
> buffers, so the extra copy done by librados is unnecessary?
> Regardless, the idea behind my PR wasn't to make librados/librbd/libwhatever
> 100% optimal right away, but to make one step towards it being efficient -
> recduce cpu and memory usage a bit, have it go through testing, if nothing
> breaks, try harder until we're close to perfection.
> even if few percent decrease in memory consumption doesn't look interesting on
> single-client level, few percents in large scale deployments may mean hundreds
> of gigabytes of memory that could be put to better use.
> 
> (2), and particularly the requirement for API changes, along with need to put
> user data into specific RADOS structure doesn't look good because
> - that requires way more changes to user code, and I don't expect any of
> current user to be interested in that (unless we force it with API change)
> - most of devs are simply lazy and instead of using "rados_buffer_create" to
> put their data directly into them, they'll keep using their own buffers and
> *then* copy them into RADOS buffers, just like we're doing it right now (even
> if behind the scenes).

Yeah, I think you're right.  The bad news is just that (1) is hard.  It's 
going to require a few things in order to address the problem of sending 
multiple MOSDOp requests for the same operation (e.g., after peering).  I 
think it will mean

- The Message superclass is going to require a new Mutex that is taken by 
the messenger implementation while accessing the encoded message payload 
buffer

- It will also need a revoke_buffers() method (that takes the mutex) so 
that Objecter can yank references to buffers for any messages in flight 
that it no longer cares about

- Objecter will need to keep a ref of the in-flight request of record.  If 
it needs to resend to a different OSD, it can use that ref to revoke the 
buffers.

- Once the operation completes, Objecter can drop it's ref to that 
Message.

Until we do that, any zero-copy changes like those in the PR will mostly 
work but trigger use-after-free is various cases where the PG mappings are 
changing.

sage