to shed more light about Accelio see some notes below Yaron > -----Original Message----- > From: Matt W. Benjamin [mailto:matt@xxxxxxxxxxxx] > Sent: Thursday, December 12, 2013 3:34 AM > To: Sage Weil > Cc: ceph-devel; Yaron Haviv; Eyal Salomon > Subject: Re: Ceph Messaging on Accelio (libxio) RDMA > > HI Sage, > > inline > > ----- "Sage Weil" <sage@xxxxxxxxxxx> wrote: > > > Hi Matt, > > > > Thanks for posting this! Some comments and questions below. > > > > > > I was originally thinking that xio was going to be more > > mellanox-specific, but it looks like it runs over multiple transports > > (even tcp!). (I'm sure I've been told this before but it apparently > > didn't sink in.) Is there also a mellanox-specific backend (that is > > not ibverbs) that takes any > > > > special advantage of mellanox hw capabilities? > [YH> ] note that Accelio is hardware independent, works over different RDMA transports (IB, RoCE, iWarp, ..) and will add non RDMA transports it is entirely open source and contains contributions from multiple vendors, see: https://github.com/accelio/accelio variety of code examples in: https://github.com/accelio/accelio/tree/master/examples/usr there are many cool transport optimizations and advanced functionality built into it, but are abstracted from the end user allowing best performance with rapid development, see more details in : http://www.accelio.org/wp-content/themes/pyramid_child/pdf/WP_Accelio_OpenSource_IO_Message_and_RPC_Acceleration_Library.pdf > The actual situation is that xio is currently ibverbs specific, though there is > interest with Mellanox and some partners in building a TCP transport for it. > [YH> ] TCP will be added early next year, the transport abstraction/plug-in mechanism is already implemented if someone want to help in that, we are open to it :) > What is true is that xio makes very advanced use of ibverbs interfaces, lock > free/wait-free allocators, rdtsc, but hides a lot of details from upper layers. > The xio designers knew how to get the most from infiniband/ RDMA, and it > shows. > [YH> ] note that Accelio is faster than using ibverbs directly, since it does many optimizations on the way it use the API, e.g. amortize HW calls, avoid locks, avoid memory coherency and locality issues .. , we can get today ~1.5M TP/s (Req+Rep) per thread and many million TP/s with multiple threads > Also, ibverbs is a first-class interface to iWARP and esp. > ROCE hardware, as well as ib. I've been doing most of my development on a > tweaked version of the softiwarp ib provider, which amounts to a full RDMA > simulator that runs on anything. (Apparently it can run over TCP, but I just > use it on one vm host.) > > I haven't worked with cci, but just glancing at it, I'd say xio stacks up very well > on ibverbs, but won't solve the TCP transport problem immediately. > > > > > Similarly, are there other projects or vendors that are looking at xio > > at this point? > > Mellanox partners are working with it mainly, I believe. > [YH> ] several open source projects will incorporate Accelio as middleware (e.g. HDFS), and many storage/database vendors are adopting it, and variety of end-users plan to use the C or Java APIs ( Java Accelio performs like the C code with 1.5M TP/s per thread, since all the transport is in hardware and it doesn’t contain context switches or locks) > > I've seen similar attempts to create this sort of library > > > > (CCI comes to mind: https://github.com/CCI/cci). Have these previous > > > > attempts influenced the design of xio at all? > > > > > > > The approach I took in incorporating Accelio was to build on the key > > abstractions > > > of Messenger, Connection, and Dispatcher, and Message, and build a > > corresponding > > > family of concrete classes: > > > > This sounds like the right approach. And we definitely want to clean > > up the separation of the abstract interfaces (Message, Connection, > > Messenger) > > from the implementations. I'm happy to pull that stuff into the tree > > > > quickly once the interfaces appear stable (although it looks like your > > > > branch is based off lots of other linuxbox bits, so it probably isn't > > > > important until this gets closer to ready). > > Ok, cool. > > > > > Also, it would be great to build out the simple_* test endpoints as > > this effort progresses; hopefully that can eventually form the basis > > of a test suite for the messenger and can be expanded to include > > various stress > > > > tests that don't require a full running cluster. > > I agree. I intend to have it at least running more Message types RSN. > > > > > > XioMessenger (concrete, implements Messenger, encapsulates xio > > endpoints, aggregates > > Agreed, I respond to this point in more detail in my reply to Greg's message. > > > > > This worries me a bit; see Greg's questions. There are several > > request/reply patterns, but many (most?) of the message exchanges are > > > > asymmetrical. I wonder if the low-level request/reply model really > > maps more closely the 'ack' stuff in SimpleMessenger (as it's about > > deallocating the sender's memory and cleaning up rdma state). > > > > > A lot of low level details of the mapping from Message to Accelio > > > messaging are currently in flux, but the basic idea is to re-use the > > > > > current encode/decode primitives as far as possible, while eliding > > the > > > acks, sequence # and tids, and timestamp behaviors of Pipe, or > > rather, > > > replacing them with mappings to Accelio primitives. I have some > > wrapper > > > classes that help with this. For the moment, the existing Ceph > > message > > > headers and footers are still there, but are now encoded/decoded, > > rather > > > than hand-marshalled. This means that checksumming is probably > > mostly > > > intact. Message signatures are not implemented. > > > > > > What works. The current prototype isn't integrated with the main > > server daemons > > > (e.g., OSD) but experimental work on that is in progress. I've > > created a pair of > > > simple standalone client/server applications > > simple_server/simple_client and > > > a matching xio_server/xio_client, that provide a minimal message > > dispatch loop with > > > a new SimpleDispatcher class and some other helpers, as a way to > > work with both > > > messengers side-by-side. These are currently very primitive, but > > will probably > > > do more things soon. The current prototype sends messages over > > Accelio, but has some issue > > > with replies, that should be fixed shortly. It leaks lots of > > memory, etc. > > > > > > We've pushed a work-in-progress branch "xio-messenger" to our > > external github > > > repository, for community review. Find it here: > > > > > > https://github.com/linuxbox2/linuxbox-ceph > > > > Looking through this, it occurs to me that there are some other > > foundational pieces that we'll need to get in place soon: > > > > - The XioMessenger is a completely different wire protocol that needs > > to be distinct from the legacy protocol. Probably we can use the > > entity_addr_t::type field for this. > > - We'll want the various *Map structures to allow multiple > > entity_addr_t's per entity. We already could use this to support both > > > > IPv4 and IPv6. In the future, though, we'll probably want clusters > > that can speak both the legacy TCP protocol (via SimpleMessenger or > > some improved implementation) and the xio one (and whatever else we > > dream up in the future). > > Ack. > > > > > Also, as has been mentioned previously, > > > > - We need to continue to migrate stuff over to the Connection-based > > Messenger interface and off the original methods that take > > entity_inst_t. > > The sticky bit here is the peer-to-peer mode that is used inside the > > OSD and MDS clusters: those need to handle racing connection attempts, > > which either requires the internal entity name -> connection map to > > resolve > > > > races (as we have now) or a new approach that pushes the > > race-resolution up into the calling code (meh). No need to address it > > now, but eventually we'll need to tackle it before this can be used on > > the osd back-side network. > > > > sage > > -- > Matt Benjamin > CohortFS, LLC. > 206 South Fifth Ave. Suite 150 > Ann Arbor, MI 48104 > > http://cohortfs.com > > tel. 734-761-4689 > fax. 734-769-8938 > cel. 734-216-5309 ��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f