Re: Ceph Messaging on Accelio (libxio) RDMA

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



HI Sage,

inline

----- "Sage Weil" <sage@xxxxxxxxxxx> wrote:

> Hi Matt,
> 
> Thanks for posting this!  Some comments and questions below.
> 
> 
> I was originally thinking that xio was going to be more
> mellanox-specific, 
> but it looks like it runs over multiple transports (even tcp!).  (I'm
> sure 
> I've been told this before but it apparently didn't sink in.)  Is
> there 
> also a mellanox-specific backend (that is not ibverbs) that takes any
>  
> special advantage of mellanox hw capabilities?

The actual situation is that xio is currently ibverbs specific, though
there is interest with Mellanox and some partners in building a TCP
transport for it.

What is true is that xio makes very advanced use of ibverbs interfaces,
lock free/wait-free allocators, rdtsc, but hides a lot of details from
upper layers.  The xio designers knew how to get the most from infiniband/
RDMA, and it shows.

Also, ibverbs is a first-class interface to iWARP and esp.
ROCE hardware, as well as ib.  I've been doing most of my development on
a tweaked version of the softiwarp ib provider, which amounts to a full
RDMA simulator that runs on anything.  (Apparently it can run over TCP,
but I just use it on one vm host.)

I haven't worked with cci, but just glancing at it, I'd say xio stacks
up very well on ibverbs, but won't solve the TCP transport problem
immediately.

> 
> Similarly, are there other projects or vendors that are looking at xio
> at 
> this point?

Mellanox partners are working with it mainly, I believe.

>  I've seen similar attempts to create this sort of library
> 
> (CCI comes to mind: https://github.com/CCI/cci).  Have these previous
> 
> attempts influenced the design of xio at all?

> > 
> > The approach I took in incorporating Accelio was to build on the key
> abstractions
> > of Messenger, Connection, and Dispatcher, and Message, and build a
> corresponding
> > family of concrete classes:
> 
> This sounds like the right approach.  And we definitely want to clean
> up 
> the separation of the abstract interfaces (Message, Connection,
> Messenger) 
> from the implementations.  I'm happy to pull that stuff into the tree
> 
> quickly once the interfaces appear stable (although it looks like your
> 
> branch is based off lots of other linuxbox bits, so it probably isn't
> 
> important until this gets closer to ready).

Ok, cool.

> 
> Also, it would be great to build out the simple_* test endpoints as
> this 
> effort progresses; hopefully that can eventually form the basis of a
> test 
> suite for the messenger and can be expanded to include various stress
> 
> tests that don't require a full running cluster.

I agree.  I intend to have it at least running more Message types
RSN.

> 
> > XioMessenger (concrete, implements Messenger, encapsulates xio
> endpoints, aggregates

Agreed, I respond to this point in more detail in my reply to Greg's
message.

> 
> This worries me a bit; see Greg's questions.  There are several 
> request/reply patterns, but many (most?) of the message exchanges are
> 
> asymmetrical.  I wonder if the low-level request/reply model really
> maps 
> more closely the 'ack' stuff in SimpleMessenger (as it's about 
> deallocating the sender's memory and cleaning up rdma state).
> 
> > A lot of low level details of the mapping from Message to Accelio 
> > messaging are currently in flux, but the basic idea is to re-use the
> 
> > current encode/decode primitives as far as possible, while eliding
> the 
> > acks, sequence # and tids, and timestamp behaviors of Pipe, or
> rather, 
> > replacing them with mappings to Accelio primitives.  I have some
> wrapper 
> > classes that help with this.  For the moment, the existing Ceph
> message 
> > headers and footers are still there, but are now encoded/decoded,
> rather 
> > than hand-marshalled.  This means that checksumming is probably
> mostly 
> > intact.  Message signatures are not implemented.
> > 
> > What works.  The current prototype isn't integrated with the main
> server daemons
> > (e.g., OSD) but experimental work on that is in progress.  I've
> created a pair of
> > simple standalone client/server applications
> simple_server/simple_client and
> > a matching xio_server/xio_client, that provide a minimal message
> dispatch loop with
> > a new SimpleDispatcher class and some other helpers, as a way to
> work with both
> > messengers side-by-side.  These are currently very primitive, but
> will probably
> > do more things soon.  The current prototype sends messages over
> Accelio, but has some issue
> > with replies, that should be fixed shortly.  It leaks lots of
> memory, etc.
> > 
> > We've pushed a work-in-progress branch "xio-messenger" to our
> external github
> > repository, for community review.  Find it here:
> > 
> > https://github.com/linuxbox2/linuxbox-ceph
> 
> Looking through this, it occurs to me that there are some other 
> foundational pieces that we'll need to get in place soon:
> 
>  - The XioMessenger is a completely different wire protocol that needs
> to 
> be distinct from the legacy protocol.  Probably we can use the 
> entity_addr_t::type field for this.
>  - We'll want the various *Map structures to allow multiple 
> entity_addr_t's per entity.  We already could use this to support both
> 
> IPv4 and IPv6.  In the future, though, we'll probably want clusters
> that 
> can speak both the legacy TCP protocol (via SimpleMessenger or some 
> improved implementation) and the xio one (and whatever else we dream
> up in 
> the future).

Ack.

> 
> Also, as has been mentioned previously,
> 
>  - We need to continue to migrate stuff over to the Connection-based 
> Messenger interface and off the original methods that take
> entity_inst_t.  
> The sticky bit here is the peer-to-peer mode that is used inside the
> OSD 
> and MDS clusters: those need to handle racing connection attempts,
> which 
> either requires the internal entity name -> connection map to resolve
> 
> races (as we have now) or a new approach that pushes the
> race-resolution 
> up into the calling code (meh).  No need to address it now, but
> eventually 
> we'll need to tackle it before this can be used on the osd back-side 
> network.
> 
> sage

-- 
Matt Benjamin
CohortFS, LLC.
206 South Fifth Ave. Suite 150
Ann Arbor, MI  48104

http://cohortfs.com

tel.  734-761-4689 
fax.  734-769-8938 
cel.  734-216-5309 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux