Hi Bart and Sagi, Thanks for warm welcome and early feedback. I will respond both of you but here, on Jack's email, since I am not in CC in the first cover letter (what a bummer). Sorry for mess. Sagi Grimberg <sagi@xxxxxxxxxxx> wrote: > - Is there room for this ibnbd? If we were to take every block driver > that was submitted without sufficient justification, it'd be very > hard to maintain. What advantage (if any) does this buy anyone over > existing rdma based protocols (srp, iser, nvmf)? I'm really (*really*) > not sold on this one... It seems that better to start from the history. IBNBD project, as it was presented, is not only the block device (which supposed to be thin), but mainly it is an rdma transport, client and server logic, which is called IBTRS in our terms. IBTRS was the starter, the main idea for us, which is planned to bind our own replicated storage solution via infiniband. We wanted clear transport interface, which should not be very much different from what normal bsd sockets provide, e.g.: ibtrs_clt_open() - Opens a session to a server. ibtrs_clt_rdma_write() - WRITE, i.e transfer data to server. ibtrs_clt_request_rdma_write() - READ, i.e. request data transfer from server. We did not want to rely, depend and embed any existent commands sets and protocols (e.g. SCSI) inside our transport. IBTRS should stay apart from any storage knowledge and should be able to do only two things: 1. establish connection to a server (accept connections from clients) 2. read/write any data Thinking about transport as a layer, IBNBD is just a user of that layer, thin block device with only 4 commands, which establishes one connection to the server and maps N block devices thru that connection. I pretty well realize, that I've described obvious things with this layering, but again, what we wanted and achieved is an independent IB transport, which is planned to be used for replication and in that project IBNBD won't exist. > - To me, the fundamental design that the client side owns a pool of > buffers that it issues writes too, seems inferior than the > one taken in iser/nvmf (immediate data). That of course discussable, since we consider that as a feature :) But indeed, we never tested against nvmf. And of course, I am open to any surprises. > IMO, the ibnbd design has > scalability issues both in terms of server side resources, Each client connection eats ~64Mb of memory on server side for the IO buffer pool. Since we are talking about hundreds of connections this is reasonable trade off (memory is considerably cheap) to avoid any allocations on IO path and make it completely lockless. > client side contention We do not have any noticeable contentions. What we have is a single queue of buffers for all devices mapped on a client per connection (session) and I can consider that as a single point of resource, where each device have to fight in order to get an empty bit from a bitmap. In practice all benchmarks we did (I have to say, that this was a pretty old 4.4 kernel, and fresh data - is the amount of work which obviously should be done) show that even we share a queue a bottleneck is always an infiniband. > and network congestion (on infiniband the latter is > less severe). We rely on IB flow control and if a server side did not respond with ib_post_recv() we do not reach the wire from a client. That should be enough. > - I honestly do not understand why you need *19057* LOC to implement > a rdma based block driver. Thats almost larger than all of our > existing block drivers combined... Oh, those LOC numbers :) IBNBD client (block device) itself is around 2800 lines with all sysfs handy things. What is really bloated is a transport client and server sides, which pretend to be smart and cover: o reconnects o heartbeats o quite huge FSM implementation for each connection (session) True, this code can be split on common parts and deflated. > First glance at the code provides > some explanations, > (1) you have some strange code that has no business > in a block driver like ibtrs_malloc/ibtrs_zalloc (yikes) Indeed, that is crap which is left from those times, when we tried to cover all generic kernel calls in order to do fault injection. > open-coding various existing logging routines, No excuses, will be vanished. > (2) you are for some > reason adding a second tag allocation scheme (why?) Several reasons. a) We share single queue of buffers for mapped devices per client connection and we still support RQ mode. b) MQ shared tags indeed are shared, but not between different hctxs. I.e. because of our single queue for N devices per connection we are not able to create M hctxs and share this single queue between them. Sometimes we need to do blk_mq_stop_hw_queue() in order to stop the hw queue and then restart it from a completion. Obviously, simplest approach is to equally share static number of buffers in queue between N devices. But that won't work since with big N we finish with very small queue depth per device. Of course, would be nice if along with BLK_MQ_F_TAG_SHARED another flag exists, say, BLK_MQ_F_TAG_GLOBALLY_SHARED. Unfortunately with our design MQ does not help us a lot with current shared tags implementation. > (3) you are open > coding a lot of stuff that we added to the stack in the past months... Could you please point precisely in the code what do you mean? If you are saying that you added a lot to nvmf, I would like to take a pause and postpone nvmf discussion till we have fresh numbers and more understanding. > (4) you seem to over-layer your code for reasons that I do not > really understand. Frankly, I did not get this. Here 'over-layer code' sounds very abstract to me. > And I didn't really look deep at all into the > code, just to get the feel of it, and it seems like it needs a lot > of work before it can even be considered upstream ready. Definetly. Bart Van Assche <Bart.VanAssche@xxxxxxxxxxx> wrote: > * Doesn't scale in terms of number of CPUs submitting I/O. The graphs shown > during the Vault talk clearly illustrate this. This is probably the result > of sharing a data structure across all client CPUs, maybe the bitmap that > tracks which parts of the target buffer space are in use. Probably that was not clear from the presentation, but that exactly what was fixed by CPU affinity (next slide after those hapless graphs). What is left unclear to us is the NUMA effect, which is not related to distances, reported by numactl. But still your advice to run everything against perf and check cache misses is more than valid. Thanks. > * Supports IB but none of the other RDMA transports (RoCE / iWARP). We did a lot of experiments with SoftRoCe and virtualized environment in order to have quick testing results on one host without hardware. That works quite well (despite SoftRoCe bugs which were fixed locally). So should not be a big deal to cover iWARP. > We also need performance numbers that compare IBNBD against SRP and/or > NVMeOF with memory registration disabled to see whether and how much faster > IBNBD is compared to these two protocols. Indeed, we've never tested against nvmf and that is a must. Thanks. -- Roman