Re: Congestion window or other reason?

Scott Atchley <atchley@xxxxxxxx> · Fri, 26 Sep 2008 16:33:23 -0400

On Sep 26, 2008, at 4:08 PM, Talpey, Thomas wrote:

I'd love to hear more about RPCMX! What is it?

It is based on the RPCRDMA code using MX. MX is Myricom's second- 
generation zero-copy, kernel-bypass API (GM was the first). Unlike IB/ 
iWarp, MX provides only a two-sided interface (send/recv) and is  
closely modeled after MPI-1 semantics.

I wrote the MX ports for Lustre and PVFS2. I am finding this to be  
more challenging than either of those.

The congestion window is all about the number of concurrent RPC  
requests,
and isn't dependent on the number of segments or even size of each  
message.
Congestion is a client-side thing, the server never delays its  
replies.

Interesting. The client does not have a global view, unfortunately,  
and has no idea how busy the server is (i.e. how many other clients it  
is servicing).

The RPC/RDMA code uses the congestion window to manage its flow  
control
window with the server. There is a second, somewhat hidden congestion
window that the RDMA adapters use between one another for RDMA Read
requests, the IRD/ORD. But those aren't visible outside the lowest  
layer.

Is this due to the fact that IB uses queue pairs (QP) and one peer  
cannot send a message to another unless a slot is available in the QP?  
If so, we do not have this limitation in MX (no QPs).

I would be surprised if you can manage hundreds of pages times dozens
of active requests without some significant resource issues at the
server. Perhaps your problems are related to those?

Tom.

In Lustre and PVFS2, the network MTU is 1 MB (and optionally 4 MB in  
PVFS2). We do not have issues in MX scaling to hundreds or thousands  
of peers (again no QPs). As for handling a few hundred MBs from a few  
hundred clients, it should be no problem. Whether the filesystem back- 
end can handle it is another question.

When using TCP with rsize=wsize=1MB, is there anything in RPC besides  
TCP that restricts how much data is sent over (or received at the  
server) initially? That is, does a client start by sending a smaller  
amount, then increase up to the 1 MB limit? Or, does it simply try to  
write() 1 MB? Or does the server read a smaller amount and then  
subsequently larger amounts?

Thanks,

Scott
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html