Re: [RFC PATCH 4/5] SUNRPC: support resvport and reuseport for rpcrdma

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/23/2024 11:10 PM, Kinglong Mee wrote:
On 2024/10/24 12:02 AM, Tom Talpey wrote:
On 10/19/2024 2:51 PM, Chuck Lever III wrote:
On Oct 19, 2024, at 6:21 AM, Kinglong Mee <kinglongmee@xxxxxxxxx> wrote:
Hi Chuck,
On 2024/10/18 10:23 PM, Chuck Lever wrote:
On Fri, Oct 18, 2024 at 05:23:29PM +0800, Kinglong Mee wrote:
Hi Chuck,
On 2024/10/17 9:24 PM, Chuck Lever wrote:
On Thu, Oct 17, 2024 at 10:52:19AM +0800, Kinglong Mee wrote:
Hi Chuck,
On 2024/10/16 9:57 PM, Chuck Lever wrote:
On Wed, Oct 16, 2024 at 07:48:25PM +0800, Kinglong Mee wrote:
NFSd's DRC key contains peer port, but rpcrdma does't support port resue now.
This patch supports resvport and resueport as tcp/udp for rpcrdma.

An RDMA consumer is not in control of the "source port" in an RDMA
connection, thus the port number is meaningless. This is why the
Linux NFS client does not already support setting the source port on
RDMA mounts, and why NFSD sets the source port value to zero on
incoming connections; the DRC then always sees a zero port value in
its lookup key tuple.

See net/sunrpc/xprtrdma/svc_rdma_transport.c :: handle_connect_req() :

259         /* The remote port is arbitrary and not under the control of the
260          * client ULP. Set it to a fixed value so that the DRC continues
261          * to be effective after a reconnect.
262          */
263         rpc_set_port((struct sockaddr *)&newxprt->sc_xprt.xpt_remote, 0);


As a general comment, your patch descriptions need to explain /why/
each change is being made. For example, the initial patches in this
series, although they properly split the changes into clean easily
digestible hunks, are somewhat baffling until the reader gets to
this one. This patch jumps right to announcing a solution.

Thanks for your comment.


There's no cover letter tying these changes together with a problem
statement. What problematic behavior did you see that motivated this
approach?

We have a private nfs server, it's DRC checks the src port, but rpcrdma doesnot
support resueport now, so we try to add it as tcp/udp.

Thank you for clarifying!

It's common for a DRC to key on the source port. Unfortunately,
IIRC, the Linux RDMA Connection Manager does not provide an API for
an RDMA consumer (such as the Linux NFS client) to set an arbitrary
source port value on the active side. rdma_bind_addr() works on the
listen side only.

rdma_bind_addr() also works on client before rdma_resolve_addr.
  From man rdma_bind_addr,
"
    NOTES
        Typically,  this routine is called before calling rdma_listen to bind to
        a specific port number, but it may also be called on the active side of
        a connection before calling rdma_resolve_addr to bind to a specific address.
"
And 9P uses rdma_bind_addr() to bind to a privileged port at p9_rdma_bind_privport().

Librdmacm supports rdma_get_local_addr(),rdma_get_peer_addr(),rdma_get_src_port() and
rdma_get_dst_port() at userspace.
So if needed, it's easy to support them in linux kernel.

Rpcrdma and nvme use the src_addr/dst_addr directly as
(struct sockaddr *)&cma_xprt->sc_cm_id->route.addr.src_addr or
(struct sockaddr *)&queue->cm_id->route.addr.dst_addr.
Call helpers may be better then directly access.

I think, there is a key question for rpcrdma.
Is the port in rpcrdma connect the same meaning as tcp/udp port?

My recollection is they aren't the same. I can check into the
semantic equivalency a little more.

However, on RDMA connections, NFSD ignores the source port value for
the purposes of evaluating the "secure" export option. Solaris, the
other reference implementation of RPC/RDMA, does not even bother
with this check on any transport.

Reusing the source port is very fraught (ie, it has been the source
of many TCP bugs) and privileged ports offer little real security.
I'd rather not add this behavior for RDMA if we can get away with
it. It's an antique.

If it is, we need support the reuseport.

I'm not following you. If an NFS server ignores the source port
value for the DRC and the secure port setting, why does the client
need to reuse the port value when reconnecting?

Yes, that's unnecessary.


Or, are you saying that you are not able to alter the behavior of
your private NFS server implementation to ignore the client's source
port?

No, if the rdma port at client side is indeed meaningless,
I will modify our private NFS server.

I just wanna known the meaning of port in rdma connection.
When mounting nfs export with rpcrdma, it connects an ip with port 20049 implicitly,
if the port in client side is meaningless, why is the server side meaningful?

As I known, rdma_cm designs those APIs based on sockets,
initially, I think the rdma port is the same as tcp/udp port.

IIRC the 20049 port is needed for NFS/RDMA on iWARP. On
IB and RoCE, it's actually unnecessary.

Right, the _destination_ port 20049 is IANA-registered because iWARP
uses TCP and the server is already listening for NFS/TCP on port 2049.
The NFSv4.1 spec allows dynamic negotiation over a single port, but
to my knowledge no client or server implements that. And of course,
NFSv3 and v4.0 are naive so they require an rdma-specific port.

IB and RoCE use the "port" number to create a service ID and have no
conflict, so while it's technically unnecessary they have to listen
on something, and the server defaults to 20049 anyway for consistency.

OTOH the _source_ port is a relic of the NFS/UDP days, where the
UDP socket was kept over the life of the mountpoint and its source
port never changed. The NFS DRC took advantage of this, in, like, 1989.

For TCP, SO_REUSEPORT was used to retain the semantic, but it's not
guaranteed to work (the port could be claimed by another socket, and
even if not, the network might still deliver old packets and cause
it to fail). Because the DRC is attempting to provide correctness,
it's IMO very unwise for a new server implementation to require
the source port to match.

So, even if rdmacm now supports it, I don't think it's advisable to
implement port reuse for NFS/RDMA.

Nfscache uses following data for duplicate request match,
         struct {
                 /* Keep often-read xid, csum in the same cache line: */
                 __be32                  k_xid;
                 __wsum                  k_csum;
                 u32                     k_proc;
                 u32                     k_prot;
                 u32                     k_vers;
                 unsigned int            k_len;
                 struct sockaddr_in6     k_addr;
         } c_key;

Nfs-ganesha uses address(conatins port), xid, and a 256bytes checksum
(libntirpc's rdma set it to zero now) for duplicate request match.

At nfs client, a xprt connect generate xid value for a rpc requst based on
its own random number(xprt->xid = get_random_u32(); and xprt->xid++;).

The problem with the traditional DRC is that it's completely unspecified
and has been implemented in numerous different ways. And it's frequently
implemented with an LRU eviction policy, which is basically the opposite
of what's needed.

Then on top of that nobody really tests it. The "nfsidem" test that I
wrote for Connectathon performs a series of non-idempotent tests, but
it requires error injection to cause connection failures while it's
running.

I'd encourage you to try that. I predict you'll find many, many issues
in pre-NFSv4.1 servers.

https://git.linux-nfs.org/?p=steved/cthon04.git;a=summary

https://git.linux-nfs.org/?p=steved/cthon04.git;a=blob;f=special/nfsidem.c

BTW, thanks for mentioning nconnect. That's another good reason for not
looking at the source port - in addition to the source address!

Tom.


Like that, a NFS mount with single xprt connet can run correctly,
two requests from the single xprt never contains a conflict xid;
but for with more xprts (nconnect > 1), two requests from different xprt may
contain the same xid from the same address(different port).

If port reuse is not guaranteed to work, we remove the use of the port from
the server's DRC(like svcrdma set it to 0, or a new implementation).

Like nfs-ganesha or some private DRC, they implements a drc pool for
each client base on address, and match other keys(xid,checksum).
Different xprts(same address, different port) shares a drc pool,
unlucky, they check xid and a zero checksum.

If we agree with the meaningless of nfs client's port,
and remove it from server's drc, xid generation should be modified
to per client, not per xprt.

Thanks,
Kinglong Mee


Tom.

I'll continue to sniff around and see if there's more to
find out.


Thanks,
Kinglong Mee



But perhaps my recollection is stale.


Maybe someone needs the src port at rpcrdma connect, I made those patches.

For the knfsd and nfs client, I don't meet any problem.

Thanks,
Kinglong Mee


Signed-off-by: Kinglong Mee <kinglongmee@xxxxxxxxx>
---
net/sunrpc/xprtrdma/transport.c |  36 ++++++++++++
net/sunrpc/xprtrdma/verbs.c     | 100 ++++++++++++++++++++++++++++++++
net/sunrpc/xprtrdma/xprt_rdma.h |   5 ++
3 files changed, 141 insertions(+)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index 9a8ce5df83ca..fee3b562932b 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -70,6 +70,10 @@ unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRWR;
int xprt_rdma_pad_optimize;
static struct xprt_class xprt_rdma;
+static unsigned int xprt_rdma_min_resvport_limit = RPC_MIN_RESVPORT;
+static unsigned int xprt_rdma_max_resvport_limit = RPC_MAX_RESVPORT;
+unsigned int xprt_rdma_min_resvport = RPC_DEF_MIN_RESVPORT;
+unsigned int xprt_rdma_max_resvport = RPC_DEF_MAX_RESVPORT;

#if IS_ENABLED(CONFIG_SUNRPC_DEBUG)

@@ -137,6 +141,24 @@ static struct ctl_table xr_tunables_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "rdma_min_resvport",
+ .data = &xprt_rdma_min_resvport,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &xprt_rdma_min_resvport_limit,
+ .extra2 = &xprt_rdma_max_resvport_limit
+ },
+ {
+ .procname = "rdma_max_resvport",
+ .data = &xprt_rdma_max_resvport,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &xprt_rdma_min_resvport_limit,
+ .extra2 = &xprt_rdma_max_resvport_limit
+ },
};

#endif
@@ -346,6 +368,20 @@ xprt_setup_rdma(struct xprt_create *args)
xprt_rdma_format_addresses(xprt, sap);

new_xprt = rpcx_to_rdmax(xprt);
+
+ if (args->srcaddr)
+ memcpy(&new_xprt->rx_srcaddr, args->srcaddr, args->addrlen);
+ else {
+ rc = rpc_init_anyaddr(args->dstaddr->sa_family,
+ (struct sockaddr *)&new_xprt->rx_srcaddr);
+ if (rc != 0) {
+ xprt_rdma_free_addresses(xprt);
+ xprt_free(xprt);
+ module_put(THIS_MODULE);
+ return ERR_PTR(rc);
+ }
+ }
+
rc = rpcrdma_buffer_create(new_xprt);
if (rc) {
xprt_rdma_free_addresses(xprt);
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index 63262ef0c2e3..0ce5123d799b 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -285,6 +285,98 @@ static void rpcrdma_ep_removal_done(struct rpcrdma_notification *rn)
xprt_force_disconnect(ep->re_xprt);
}

+static int rpcrdma_get_random_port(void)
+{
+ unsigned short min = xprt_rdma_min_resvport, max = xprt_rdma_max_resvport;
+ unsigned short range;
+ unsigned short rand;
+
+ if (max < min)
+ return -EADDRINUSE;
+ range = max - min + 1;
+ rand = get_random_u32_below(range);
+ return rand + min;
+}
+
+static void rpcrdma_set_srcport(struct rpcrdma_xprt *r_xprt, struct rdma_cm_id *id)
+{
+        struct sockaddr *sap = (struct sockaddr *)&id->route.addr.src_addr;
+
+ if (r_xprt->rx_srcport == 0 && r_xprt->rx_xprt.reuseport) {
+ switch (sap->sa_family) {
+ case AF_INET6:
+ r_xprt->rx_srcport = ntohs(((struct sockaddr_in6 *)sap)->sin6_port);
+ break;
+ case AF_INET:
+ r_xprt->rx_srcport = ntohs(((struct sockaddr_in *)sap)->sin_port);
+ break;
+ }
+ }
+}
+
+static int rpcrdma_get_srcport(struct rpcrdma_xprt *r_xprt)
+{
+ int port = r_xprt->rx_srcport;
+
+ if (port == 0 && r_xprt->rx_xprt.resvport)
+ port = rpcrdma_get_random_port();
+ return port;
+}
+
+static unsigned short rpcrdma_next_srcport(struct rpcrdma_xprt *r_xprt, unsigned short port)
+{
+ if (r_xprt->rx_srcport != 0)
+ r_xprt->rx_srcport = 0;
+ if (!r_xprt->rx_xprt.resvport)
+ return 0;
+ if (port <= xprt_rdma_min_resvport || port > xprt_rdma_max_resvport)
+ return xprt_rdma_max_resvport;
+ return --port;
+}
+
+static int rpcrdma_bind(struct rpcrdma_xprt *r_xprt, struct rdma_cm_id *id)
+{
+ struct sockaddr_storage myaddr;
+ int err, nloop = 0;
+ int port = rpcrdma_get_srcport(r_xprt);
+ unsigned short last;
+
+ /*
+  * If we are asking for any ephemeral port (i.e. port == 0 &&
+  * r_xprt->rx_xprt.resvport == 0), don't bind.  Let the local
+  * port selection happen implicitly when the socket is used
+  * (for example at connect time).
+  *
+  * This ensures that we can continue to establish TCP
+  * connections even when all local ephemeral ports are already
+  * a part of some TCP connection.  This makes no difference
+  * for UDP sockets, but also doesn't harm them.
+  *
+  * If we're asking for any reserved port (i.e. port == 0 &&
+  * r_xprt->rx_xprt.resvport == 1) rpcrdma_get_srcport above will
+  * ensure that port is non-zero and we will bind as needed.
+  */
+ if (port <= 0)
+ return port;
+
+ memcpy(&myaddr, &r_xprt->rx_srcaddr, r_xprt->rx_xprt.addrlen);
+ do {
+ rpc_set_port((struct sockaddr *)&myaddr, port);
+ err = rdma_bind_addr(id, (struct sockaddr *)&myaddr);
+ if (err == 0) {
+ if (r_xprt->rx_xprt.reuseport)
+ r_xprt->rx_srcport = port;
+ break;
+ }
+ last = port;
+ port = rpcrdma_next_srcport(r_xprt, port);
+ if (port > last)
+ nloop++;
+ } while (err == -EADDRINUSE && nloop != 2);
+
+ return err;
+}
+
static struct rdma_cm_id *rpcrdma_create_id(struct rpcrdma_xprt *r_xprt,
      struct rpcrdma_ep *ep)
{
@@ -300,6 +392,12 @@ static struct rdma_cm_id *rpcrdma_create_id(struct rpcrdma_xprt *r_xprt,
if (IS_ERR(id))
return id;

+ rc = rpcrdma_bind(r_xprt, id);
+ if (rc) {
+ rc = -ENOTCONN;
+ goto out;
+ }
+
ep->re_async_rc = -ETIMEDOUT;
rc = rdma_resolve_addr(id, NULL, (struct sockaddr *)&xprt->addr,
         RDMA_RESOLVE_TIMEOUT);
@@ -328,6 +426,8 @@ static struct rdma_cm_id *rpcrdma_create_id(struct rpcrdma_xprt *r_xprt,
if (rc)
goto out;

+ rpcrdma_set_srcport(r_xprt, id);
+
return id;

out:
diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
index 8147d2b41494..9c7bcb541267 100644
--- a/net/sunrpc/xprtrdma/xprt_rdma.h
+++ b/net/sunrpc/xprtrdma/xprt_rdma.h
@@ -433,6 +433,9 @@ struct rpcrdma_xprt {
struct delayed_work rx_connect_worker;
struct rpc_timeout rx_timeout;
struct rpcrdma_stats rx_stats;
+
+ struct sockaddr_storage rx_srcaddr;
+ unsigned short rx_srcport;
};

#define rpcx_to_rdmax(x) container_of(x, struct rpcrdma_xprt, rx_xprt)
@@ -581,6 +584,8 @@ static inline void rpcrdma_set_xdrlen(struct xdr_buf *xdr, size_t len)
   */
extern unsigned int xprt_rdma_max_inline_read;
extern unsigned int xprt_rdma_max_inline_write;
+extern unsigned int xprt_rdma_min_resvport;
+extern unsigned int xprt_rdma_max_resvport;
void xprt_rdma_format_addresses(struct rpc_xprt *xprt, struct sockaddr *sap);
void xprt_rdma_free_addresses(struct rpc_xprt *xprt);
void xprt_rdma_close(struct rpc_xprt *xprt);
--
2.47.0


--
Chuck Lever










[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux