Reviving an old thread....
Hi Tom Talpey and Tom Tucker, it was good to meet you at SC08. :-)
On Sep 30, 2008, at 8:34 AM, Talpey, Thomas wrote:
I believe that I am duplicating the RPCRDMA usage of credits. I need
to check.
If you are passing credits, then be sure you're managing them
correctly and
that you have at least as many client RPC slots configured as the
server can
optimally handle. It's very important to throughput. The NFSv4.1
protocol
will manage these explicitly, btw - the session "slot table" is
basically the
same interaction. The RPC/RDMA credits will basically be managed by
them,
so the whole stack will bebefit.
A quick glance seems to indicate that I am using credits. The client
never sends more than 32 requests in my tests.
RPCRDMA credits are primarily used for this, it's not so much the
fact that there's a queuepair, it's actually the number of posted
receives. If the client sends more than the server has available,
then the connection
will fail. However, the server can implement something called
"shared receive queue" which permits a sort of oversubscription.
MX's behavior is more like the shared receive queue. Unexpected
messages <=32KB are stored in a temp buffer until the matching
receive
has been posted. Once it is posted, the data is copied to the receive
buffers and the app can complete the request by testing (polling) or
waiting (blocking).
Ouch. I guess that's convenient for the upper layer, but it costs
quite a
bit of NIC memory, or if host memory is used, makes latency and bus
traffic quite indeterminate. I would strongly suggest fully
provisioning
each server endpoint, and using the protocol's credits to manage
resources.
Host memory. In the kernel, we limit the unexpected queue to 2 MB in
the kernel. Ideally, the only unexpected messages are RPC requests,
and I have already allocated 32 per client.
I chose not to pre-post the receives for the client's request
messages
since they could overwhelm the MX posted receive list. By using the
unexpected handler, only bulk IO are pre-posted (i.e. after the
request has come in).
The client never posts more than the max_inline_write size, which is
fully configurable. By default, it's only 1KB, and there are
normally just
32 credits. Bulk data is handled by RDMA, which can be scheduled at
the server's convenience - this is a key design point of the RPC/RDMA
protocol. Only 32KB per client is "overwhelm" territory?
I upped my inline size to 3072 bytes (each context gets a full page,
but I can't use all of it since the header needs to go in there).
32 KB is not overwhelm territory. Posting 32 identical, small recvs
for RPC request messages per client (e.g. 1000 clients) would mean
that to match a single, large IO, MX would have to walk a linked-list
with potentially 32,000 small messages before finding the correct
large message. Using the unexpected handler to manage RPC requests in
an active message manner keeps the posted recv linked-list populated
only with large IO messages.
I could instead have RPC requests and IO messages on separate
completion queues which would do the same thing. I use the former out
of habit.
Yes, and dedicating that much memory to clients is another. With the
IB and iWARP protocols and the current Linux server, these buffers
are
not shared. This enhances integrity and protection, but it limits
the
maximum scaling. I take it this is not a concern for you?
I am not sure about what you mean by integrity and protection. A
buffer is only used by one request at a time.
Correct - and that's precisely the goal. The issue is whether there
are
data paths which can expose the buffer(s) outside of the scope of a
single request, for example to allow a buggy server to corrupt
messages
which are being processed at the client, or to allow attacks on
clients or
servers from foreign hosts. Formerly, with IB and iWARP we had to
choose
between performance and protection. With the new iWARP "FRMR"
facility,
we (finally) have a scheme that protects well, without costing a large
per-io penalty.
Hmmm. When using MX over Myrinet, such an attack is not feasible. When
using MX over Ethernet, it is still probably not feasible since MX
traffic is not viewable within the kernel (via wireshark, etc.). Could
someone use a non-Myricom NIC to craft a bogus Myrinet over Ethernet
frame, it is theoretically possible.
RPC is purely a request/response mechanism, with rules for
discovering
endpoints and formatting requests and replies. RPCRDMA adds framing
for RDMA networks, and mechanisms for managing RDMA networks such
as credits and rules on when to use RDMA. Finally, the NFS/RDMA
transport binding makes requirements for sending messages. Since
there are
several NFS protocol versions, the answer to your question depends
on that.
There is no congestion control (slow start, message sizes) in the
RPC
protocol, however there are many implementations of it in RPC.
I am trying to duplicate all of the above from RPCRDMA. I am curious
why a client read of 256 pages with a rsize of 128 pages arrives in
three transfers of 32, 128, and then 96 pages. I assume that the same
reason is allowing client writes to succeed only if the max pages
is 32.
Usually, this is because the server's filesystem delivered the
results in
these chunks. For example, yours may have had a 128-page extent size,
which the client was reading on a 96-page offset. Therefore the
first read
yielded the last 32 pages of the first extent, followed by a full
128 and
a 96 to finish up. Or perhaps, it was simply convenient for it to
return
them in such a way.
You can maybe recode the server to perform full-sized IO, but I don't
recommend it. You'll be performing synchronous filesystem ops in order
to avoid a few network transfers. That is, in all likelihood, a very
bad
trade. But I don't know your server.
I was just curious. Thanks for the clear explanation. We will behave
like the others and service what NFS hands us.
I'm not certain if your question is purely about TCP, or if it's
about RDMA with TCP as an example. However in both cases the
answer is the same:
it's not about the size of a message, it's about the message itself.
If the client and server have agreed that a 1MB write is ok, then
yes the
client may immediately send 1MB.
Tom.
Hmmm, I will try to debug the svc_process code to find the oops.
I found several bugs and I think I have fixed them. I seem to have it
working correctly with 32 KB messages (reading, writing, [un]mounting,
etc.). On a few reads or writes out of a 1,000, I will get a NFS stale
handle error. I need to track this down.
Also, when using more than 8 pages (32 KB), reads and writes complete
but the data is corrupted. This is clearly a bug in my code and I am
looking into it.
Scott
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html