Re: [RFC PATCH 0/5] Fun with the multipathing code

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> On Apr 28, 2017, at 2:08 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
> 
> On Fri, 2017-04-28 at 10:45 -0700, Chuck Lever wrote:
>>> On Apr 28, 2017, at 10:25 AM, Trond Myklebust <trond.myklebust@prim
>>> arydata.com> wrote:
>>> 
>>> In the spirit of experimentation, I've put together a set of
>>> patches
>>> that implement setting up multiple TCP connections to the server.
>>> The connections all go to the same server IP address, so do not
>>> provide support for multiple IP addresses (which I believe is
>>> something Andy Adamson is working on).
>>> 
>>> The feature is only enabled for NFSv4.1 and NFSv4.2 for now; I
>>> don't
>>> feel comfortable subjecting NFSv3/v4 replay caches to this
>>> treatment yet. It relies on the mount option "nconnect" to specify
>>> the number of connections to st up. So you can do something like
>>> 聽'mount -t nfs -overs=4.1,nconnect=8 foo:/bar /mnt'
>>> to set up 8 TCP connections to server 'foo'.
>> 
>> IMO this setting should eventually be set dynamically by the
>> client, or should be global (eg., a module parameter).
> 
> There is an argument for making it a per-server value (which is what
> this patchset does). It allows the admin a certain control to limit the
> number of connections to specific servers that are need to serve larger
> numbers of clients. However I'm open to counter arguments. I've no
> strong opinions yet.

Like direct I/O, this kind of setting could allow a single
client to DoS a server.

One additional concern might be how to deal with servers who
have no more ability to accept connections during certain
periods, but are able to support a lot of connections at
other times.


>> Since mount points to the same server share the same transport,
>> what happens if you specify a different "nconnect" setting on
>> two mount points to the same server?
> 
> Currently, the first one wins.
> 
>> What will the client do if there are not enough resources
>> (eg source ports) to create that many? Or is this an "up to N"
>> kind of setting? I can imagine a big client having to reduce
>> the number of connections to each server to help it scale in
>> number of server connections.
> 
> There is an arbitrary (compile time) limit of 16. The use of the
> SO_REUSEPORT socket option ensures that we should almost always be able
> to satisfy that number of source ports, since they can be shared with
> connections to other servers.

FWIW, Solaris limits this setting to 8. I think past that
value, there is only incremental and diminishing gain.
That could be apples to pears, though.

I'm not aware of a mount option, but there might be a
system tunable that controls this setting on each client.


>> Other storage protocols have a mechanism for determining how
>> transport connections are provisioned: One connection per
>> CPU core (or one CPU per NUMA node) on the client. This gives
>> a clear way to decide which connection to use for each RPC,
>> and guarantees the reply will arrive at the same compute
>> domain that sent the call.
> 
> Can we perhaps lay out a case for which mechanisms are useful as far as
> hardware is concerned? I understand the socket code is already
> affinitised to CPU caches, so that one's easy. I'm less familiar with
> the various features of the underlying offloaded NICs and how they tend
> to react when you add/subtract TCP connections.

Well, the optimal number of connections varies depending on
the NIC hardware design. I don't think there's a hard-and-fast
rule, but typically the server-class NICs have multiple DMA
engines and multiple cores. Thus they benefit from having
multiple sockets, up to a point.

Smaller clients would have a handful of cores, a single
memory hierarchy, and one NIC. I would guess optimizing for
the NIC (or server) would be best in that case. I'd bet
two connections would be a very good default.

For large clients, a connection per NUMA node makes sense.
This keeps the amount of cross-node memory traffic to a
minimum, which improves system scalability.

The issues with "socket per CPU core" are: there can be a lot
of cores, and it might be wasteful to open that many sockets
to each NFS server; and what do you do with a socket when
a CPU core is taken offline?


>> And of course: RPC-over-RDMA really loves this kind of feature
>> (multiple connections between same IP tuples) to spread the
>> workload over multiple QPs. There isn't anything special needed
>> for RDMA, I hope, but I'll have a look at the SUNRPC pieces.
> 
> I haven't yet enabled it for RPC/RDMA, but I imagine you can help out
> if you find it useful (as you appear to do).

I can give the patch set a try this week. I haven't seen any
thing that would exclude proto=rdma from playing in this
sandbox.


>> Thanks for posting, I'm looking forward to seeing this
>> capability in the Linux client.
>> 
>> 
>>> Anyhow, feel free to test and give me feedback as to whether or not
>>> this helps performance on your system.
>>> 
>>> Trond Myklebust (5):
>>> 聽SUNRPC: Allow creation of RPC clients with multiple connections
>>> 聽NFS: Add a mount option to specify number of TCP connections to
>>> use
>>> 聽NFSv4: Allow multiple connections to NFSv4.x (x>0) servers
>>> 聽pNFS: Allow multiple connections to the DS
>>> 聽NFS: Display the "nconnect" mount option if it is set.
>>> 
>>> fs/nfs/client.c聽聽聽聽聽聽聽聽聽聽聽聽聽|聽聽2 ++
>>> fs/nfs/internal.h聽聽聽聽聽聽聽聽聽聽聽|聽聽2 ++
>>> fs/nfs/nfs3client.c聽聽聽聽聽聽聽聽聽|聽聽3 +++
>>> fs/nfs/nfs4client.c聽聽聽聽聽聽聽聽聽| 13 +++++++++++--
>>> fs/nfs/super.c聽聽聽聽聽聽聽聽聽聽聽聽聽聽| 12 ++++++++++++
>>> include/linux/nfs_fs_sb.h聽聽聽|聽聽1 +
>>> include/linux/sunrpc/clnt.h |聽聽1 +
>>> net/sunrpc/clnt.c聽聽聽聽聽聽聽聽聽聽聽| 17 ++++++++++++++++-
>>> net/sunrpc/xprtmultipath.c聽聽|聽聽3 +--
>>> 9 files changed, 49 insertions(+), 5 deletions(-)
>>> 
>>> --聽
>>> 2.9.3
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-
>>> nfs" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at聽聽http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> trond.myklebust@xxxxxxxxxxxxxxx
> �N嫥叉靣笡y氊b瞂千v豝�)藓{.n�+壏{睗�"炟^n噐■��侂h櫒璀�&Ⅷ�瓽珴閔��(殠娸"濟���m��飦赇z罐枈帼f"穐殘坢

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux