On Sep 26, 2008, at 4:08 PM, Talpey, Thomas wrote:
I'd love to hear more about RPCMX! What is it?
It is based on the RPCRDMA code using MX. MX is Myricom's second-
generation zero-copy, kernel-bypass API (GM was the first). Unlike IB/
iWarp, MX provides only a two-sided interface (send/recv) and is
closely modeled after MPI-1 semantics.
I wrote the MX ports for Lustre and PVFS2. I am finding this to be
more challenging than either of those.
The congestion window is all about the number of concurrent RPC
requests,
and isn't dependent on the number of segments or even size of each
message.
Congestion is a client-side thing, the server never delays its
replies.
Interesting. The client does not have a global view, unfortunately,
and has no idea how busy the server is (i.e. how many other clients it
is servicing).
The RPC/RDMA code uses the congestion window to manage its flow
control
window with the server. There is a second, somewhat hidden congestion
window that the RDMA adapters use between one another for RDMA Read
requests, the IRD/ORD. But those aren't visible outside the lowest
layer.
Is this due to the fact that IB uses queue pairs (QP) and one peer
cannot send a message to another unless a slot is available in the QP?
If so, we do not have this limitation in MX (no QPs).
I would be surprised if you can manage hundreds of pages times dozens
of active requests without some significant resource issues at the
server. Perhaps your problems are related to those?
Tom.
In Lustre and PVFS2, the network MTU is 1 MB (and optionally 4 MB in
PVFS2). We do not have issues in MX scaling to hundreds or thousands
of peers (again no QPs). As for handling a few hundred MBs from a few
hundred clients, it should be no problem. Whether the filesystem back-
end can handle it is another question.
When using TCP with rsize=wsize=1MB, is there anything in RPC besides
TCP that restricts how much data is sent over (or received at the
server) initially? That is, does a client start by sending a smaller
amount, then increase up to the 1 MB limit? Or, does it simply try to
write() 1 MB? Or does the server read a smaller amount and then
subsequently larger amounts?
Thanks,
Scott
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html