Re: bufferlist-free seastar osd?

kefu chai <tchaikov@xxxxxxxxx> · Mon, 10 Dec 2018 23:26:53 +0800

On Fri, Dec 7, 2018 at 2:20 AM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
>
> On Fri, Nov 30, 2018 at 7:02 AM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> >
> >
> > On 11/29/18 7:17 PM, Gregory Farnum wrote:
> > >> On Thu, Nov 29, 2018 at 1:46 PM Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> > >>> Bufferlist is a common theme in Mark's weekly performance meetings, and
> > >>> today's was no exception. Radoslaw highlighted the cost of atomic ref
> > >>> counting associated with sharing buffers, and there was broad agreement
> > >>> that the seastar osd's data path should avoid using bufferlist entirely
> > >>> for that reason.
> > >>>
> > >>> However, one choice that I made in the design of the seastar messenger
> > >>> was to use Ceph's existing bufferlist-based Message types so that their
> > >>> encode/decode wouldn't diverge between the seastar osd and its librados
> > >>> clients and peers like ceph-mon/mgr.
> > >>>
> > >>> In order to create a bufferlist-free data path, we'll need to find some
> > >>> abstraction that allows Message types to define a serialization format
> > >>> that can work both for bufferlist and for seastar's native buffer types
> > >>> - whether that covers all Message types, or just some subset of
> > >>> important ones like MOSDOp/Reply.

i think there are two problems in raw_seastar_foreign_ptr and bufferlist is

1. internal fragments
2. (atomic) refcount
3. memcpy

let's discuss them one after another, using serving a write request in
crimson-osd as a use case:

-  internal fragments:
  the goal of eliminating internal fragments is to ensure that the
life cycle of the allocated buffer and the used chunk in it is
(almost) the same. underneath of input_stream::consume(),
native_connected_socket_impl::get() is called to read a
temporary_buffer<> from TCP socket, each time
native_connected_socket_impl::get() gets called, it grab a
temporary_buffer<> from a `packet`, in which an array of char* is
maintained. and all temporary_buffers sharing the same packet hold a
strong references of the packet. put in another way, the packet is
refcounted. so if we read the packet in a *right rhythm*, we can avoid
the most of the internal fragments. for instance, if we are about to
read the payload sizing 4KiB which will be sent to disk, we should
read them all at once in a separated call. or, if we cannot expect the
life cycle of memory chunk in payload when reading it, but we end up
finding that only 40 bytes out of a 4MiB packet will be sent to disk,
we should consider deep copy the 40 bytes in order to return the 4MiB
memory to system.

- (atomic) refcount
currently we use a subclass of ceph::buffer::raw to encapsulate
seastar::temporary_buffer -- raw_seastar_foreign_ptr.  so we have two
refcounting in action here:
1. ceph::buffer::raw::nref: as raw_seastar_foreign_ptr is a subclass
of ceph::buffer::raw, so it is refcounted.
2. seastar::deleter::impl::nref: seastar is using a chained
polymorphic deleter for managing the life cycle of temporary_buffer.
and like ptr, all deleters are derived from seastar::deleter::impl,
which is refcounted.
we could develop a native data structure to manage an array
seastar::temporary_buffer directly, without the extra
ceph::buffer::raw layer. we could name it ceph::buffers. and to along
with the denc framework, ceph::buffers should fulfill the concept of
two set of accessors:
1. contiguous accessor: ensuring that the underlying temporay_buffer
is large enough to hold the whole object to be decoded
2. non-contiguous accessor: we cannot guarantee that the underlying
temporay_buffer is large enough to hold the whole encoded object

i think we can have two different types for them to make sure we can
work with contiguous buffer and non-contiguous buffer read from wire.
they are (apparently):

1. ceph::buffers
2. seastar::temporary_buffer

and we need to

1. add two iterator types for accessing them
2. add denc implementation:
  * encode(), decode() for all primitive types on top of which we can
rely on the abstraction of denc() to access them like Greg pointed
out. or
  * generalize get_pos_add(), and add a copy_buffers() so it works
with buffer::list::const_iterator and ceph::buffers::const_iterator

- memcpy
  i think, by reading the right size of buffer in the right rhythm, we
will suffer less from memcpy. we can even preallocate a
seastar::buffers with the right padding for holding a write request to
minimize the overhead of block-aligned required by direct io when
writing to disk.

> > >>>
> > >>> Any thoughts on what this could look like?
> > >>>
> > >>> Casey
> > > On Thu, Nov 29, 2018 at 2:32 PM Noah Watkins <nwatkins@xxxxxxxxxx> wrote:
> > >> Potentially stupid question... if there exists a buffer container that
> > >> can work in the new data path _without reference counting_, why not
> > >> create a bufferlist super class that doesn't do reference counting?
> > > Or even just let seastar wrap its own buffers in a new buffer ptr type
> > > and generate a new bufferlist when it passes that data into the
> > > non-seastar-ized code?
> > > -Greg
> >
> > Yeah, that's essentially what the messenger does now - it reads
> > seastar's native buffers off the network, wraps them in a buffer::raw
> > type, and passes them as bufferlists to ::decode_message(). There's a
> > similar conversion from bufferlist to seastar::net::packet for the
> > send() side.
> >
> > So the challenge is to find a way to decode messages from their native
> > buffers, avoiding conversions to bufferlist until we need to pass them
> > to common code that expects bufferlist. Ideally, the data path won't
> > require any of these conversions.
>
> So while the literal encode/decode functions in a given Message are
> fixed type, the actual logic of them pretty friendly to being swapped
> out with macros thanks to ENCODE_START etc. I haven’t thought about it
> deeply but I presume it would be pretty simple to extend the message
> types to add seastar-buffer encodes with different buffer types but
> otherwise the same signature, and to compile the individual Message’s
> encode/decode functions with both as either macros or templates?
>
> Then for the rare message types that need to convert their underlying
> data into the other buffer format, we can do that either explicitly at
> the handoff boundaries or implicitly when a function asks for the data
> in a particular format.
>
> Am I missing something that makes this particularly difficult?
> -Greg

-- 
Regards
Kefu Chai