Re: Congestion window or other reason?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Reviving an old thread....

Hi Tom Talpey and Tom Tucker, it was good to meet you at SC08. :-)

On Sep 30, 2008, at 8:34 AM, Talpey, Thomas wrote:

I believe that I am duplicating the RPCRDMA usage of credits. I need
to check.

If you are passing credits, then be sure you're managing them correctly and that you have at least as many client RPC slots configured as the server can optimally handle. It's very important to throughput. The NFSv4.1 protocol will manage these explicitly, btw - the session "slot table" is basically the same interaction. The RPC/RDMA credits will basically be managed by them,
so the whole stack will bebefit.

A quick glance seems to indicate that I am using credits. The client never sends more than 32 requests in my tests.

RPCRDMA credits are primarily used for this, it's not so much the
fact that there's a queuepair, it's actually the number of posted receives. If the client sends more than the server has available, then the connection will fail. However, the server can implement something called "shared receive queue" which permits a sort of oversubscription.

MX's behavior is more like the shared receive queue. Unexpected
messages <=32KB are stored in a temp buffer until the matching receive
has been posted. Once it is posted, the data is copied to the receive
buffers and the app can complete the request by testing (polling) or
waiting (blocking).

Ouch. I guess that's convenient for the upper layer, but it costs quite a
bit of NIC memory, or if host memory is used, makes latency and bus
traffic quite indeterminate. I would strongly suggest fully provisioning each server endpoint, and using the protocol's credits to manage resources.

Host memory. In the kernel, we limit the unexpected queue to 2 MB in the kernel. Ideally, the only unexpected messages are RPC requests, and I have already allocated 32 per client.

I chose not to pre-post the receives for the client's request messages
since they could overwhelm the MX posted receive list. By using the
unexpected handler, only bulk IO are pre-posted (i.e. after the
request has come in).

The client never posts more than the max_inline_write size, which is
fully configurable. By default, it's only 1KB, and there are normally just
32 credits. Bulk data is handled by RDMA, which can be scheduled at
the server's convenience - this is a key design point of the RPC/RDMA
protocol. Only 32KB per client is "overwhelm" territory?

I upped my inline size to 3072 bytes (each context gets a full page, but I can't use all of it since the header needs to go in there).

32 KB is not overwhelm territory. Posting 32 identical, small recvs for RPC request messages per client (e.g. 1000 clients) would mean that to match a single, large IO, MX would have to walk a linked-list with potentially 32,000 small messages before finding the correct large message. Using the unexpected handler to manage RPC requests in an active message manner keeps the posted recv linked-list populated only with large IO messages.

I could instead have RPC requests and IO messages on separate completion queues which would do the same thing. I use the former out of habit.

Yes, and dedicating that much memory to clients is another. With the
IB and iWARP protocols and the current Linux server, these buffers are not shared. This enhances integrity and protection, but it limits the
maximum scaling. I take it this is not a concern for you?

I am not sure about what you mean by integrity and protection. A
buffer is only used by one request at a time.

Correct - and that's precisely the goal. The issue is whether there are
data paths which can expose the buffer(s) outside of the scope of a
single request, for example to allow a buggy server to corrupt messages which are being processed at the client, or to allow attacks on clients or servers from foreign hosts. Formerly, with IB and iWARP we had to choose between performance and protection. With the new iWARP "FRMR" facility,
we (finally) have a scheme that protects well, without costing a large
per-io penalty.

Hmmm. When using MX over Myrinet, such an attack is not feasible. When using MX over Ethernet, it is still probably not feasible since MX traffic is not viewable within the kernel (via wireshark, etc.). Could someone use a non-Myricom NIC to craft a bogus Myrinet over Ethernet frame, it is theoretically possible.

RPC is purely a request/response mechanism, with rules for discovering
endpoints and formatting requests and replies. RPCRDMA adds framing
for RDMA networks, and mechanisms for managing RDMA networks such
as credits and rules on when to use RDMA. Finally, the NFS/RDMA
transport binding makes requirements for sending messages. Since there are several NFS protocol versions, the answer to your question depends on that. There is no congestion control (slow start, message sizes) in the RPC
protocol, however there are many implementations of it in RPC.

I am trying to duplicate all of the above from RPCRDMA. I am curious
why a client read of 256 pages with a rsize of 128 pages arrives in
three transfers of 32, 128, and then 96 pages. I assume that the same
reason is allowing client writes to succeed only if the max pages is 32.

Usually, this is because the server's filesystem delivered the results in
these chunks. For example, yours may have had a 128-page extent size,
which the client was reading on a 96-page offset. Therefore the first read yielded the last 32 pages of the first extent, followed by a full 128 and a 96 to finish up. Or perhaps, it was simply convenient for it to return
them in such a way.

You can maybe recode the server to perform full-sized IO, but I don't
recommend it. You'll be performing synchronous filesystem ops in order
to avoid a few network transfers. That is, in all likelihood, a very bad
trade. But I don't know your server.

I was just curious. Thanks for the clear explanation. We will behave like the others and service what NFS hands us.

I'm not certain if your question is purely about TCP, or if it's
about RDMA with TCP as an example. However in both cases the answer is the same:
it's not about the size of a message, it's about the message itself.
If the client and server have agreed that a 1MB write is ok, then yes the
client may immediately send 1MB.

Tom.

Hmmm, I will try to debug the svc_process code to find the oops.

I found several bugs and I think I have fixed them. I seem to have it working correctly with 32 KB messages (reading, writing, [un]mounting, etc.). On a few reads or writes out of a 1,000, I will get a NFS stale handle error. I need to track this down.

Also, when using more than 8 pages (32 KB), reads and writes complete but the data is corrupted. This is clearly a bug in my code and I am looking into it.

Scott
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux