On Tue, 2019-06-11 at 10:51 -0400, Chuck Lever wrote: > Hi Neil- > > > > On Jun 10, 2019, at 9:09 PM, NeilBrown <neilb@xxxxxxxx> wrote: > > > > On Fri, May 31 2019, Chuck Lever wrote: > > > > > > On May 30, 2019, at 6:56 PM, NeilBrown <neilb@xxxxxxxx> wrote: > > > > > > > > On Thu, May 30 2019, Chuck Lever wrote: > > > > > > > > > Hi Neil- > > > > > > > > > > Thanks for chasing this a little further. > > > > > > > > > > > > > > > > On May 29, 2019, at 8:41 PM, NeilBrown <neilb@xxxxxxxx> > > > > > > wrote: > > > > > > > > > > > > This patch set is based on the patches in the multipath_tcp > > > > > > branch of > > > > > > git://git.linux-nfs.org/projects/trondmy/nfs-2.6.git > > > > > > > > > > > > I'd like to add my voice to those supporting this work and > > > > > > wanting to > > > > > > see it land. > > > > > > We have had customers/partners wanting this sort of > > > > > > functionality for > > > > > > years. In SLES releases prior to SLE15, we've provide a > > > > > > "nosharetransport" mount option, so that several filesystem > > > > > > could be > > > > > > mounted from the same server and each would get its own TCP > > > > > > connection. > > > > > > > > > > Is it well understood why splitting up the TCP connections > > > > > result > > > > > in better performance? > > > > > > > > > > > > > > > > In SLE15 we are using this 'nconnect' feature, which is > > > > > > much nicer. > > > > > > > > > > > > Partners have assured us that it improves total throughput, > > > > > > particularly with bonded networks, but we haven't had any > > > > > > concrete > > > > > > data until Olga Kornievskaia provided some concrete test > > > > > > data - thanks > > > > > > Olga! > > > > > > > > > > > > My understanding, as I explain in one of the patches, is > > > > > > that parallel > > > > > > hardware is normally utilized by distributing flows, rather > > > > > > than > > > > > > packets. This avoid out-of-order deliver of packets in a > > > > > > flow. > > > > > > So multiple flows are needed to utilizes parallel hardware. > > > > > > > > > > Indeed. > > > > > > > > > > However I think one of the problems is what happens in > > > > > simpler scenarios. > > > > > We had reports that using nconnect > 1 on virtual clients > > > > > made things > > > > > go slower. It's not always wise to establish multiple > > > > > connections > > > > > between the same two IP addresses. It depends on the hardware > > > > > on each > > > > > end, and the network conditions. > > > > > > > > This is a good argument for leaving the default at '1'. When > > > > documentation is added to nfs(5), we can make it clear that the > > > > optimal > > > > number is dependant on hardware. > > > > > > Is there any visibility into the NIC hardware that can guide this > > > setting? > > > > > > > I doubt it, partly because there is more than just the NIC hardware > > at issue. > > There is also the server-side hardware and possibly hardware in the > > middle. > > So the best guidance is YMMV. :-) > > > > > > > What about situations where the network capabilities between > > > > > server and > > > > > client change? Problem is that neither endpoint can detect > > > > > that; TCP > > > > > usually just deals with it. > > > > > > > > Being able to manually change (-o remount) the number of > > > > connections > > > > might be useful... > > > > > > Ugh. I have problems with the administrative interface for this > > > feature, > > > and this is one of them. > > > > > > Another is what prevents your client from using a different > > > nconnect= > > > setting on concurrent mounts of the same server? It's another > > > case of a > > > per-mount setting being used to control a resource that is shared > > > across > > > mounts. > > > > I think that horse has well and truly bolted. > > It would be nice to have a "server" abstraction visible to user- > > space > > where we could adjust settings that make sense server-wide, and > > then a way > > to mount individual filesystems from that "server" - but we don't. > > Even worse, there will be some resource sharing between containers > that > might be undesirable. The host should have ultimate control over > those > resources. > > But that is neither here nor there. We can't and we don't normally share NFS resources between containers unless they share a network namespace. IOW: containers should normally work just fine with each container able to control its own connections to any given server. > > > Probably the best we can do is to document (in nfs(5)) which > > options are > > per-server and which are per-mount. > > Alternately, the behavior of this option could be documented this > way: > > The default value is one. To resolve conflicts between nconnect > settings on > different mount points to the same server, the value set on the first > mount > applies until there are no more mounts of that server, unless > nosharecache > is specified. When following a referral to another server, the > nconnect > setting is inherited, but the effective value is determined by other > mounts > of that server that are already in place. > > I hate to say it, but the way to make this work deterministically is > to > ask administrators to ensure that the setting is the same on all > mounts > of the same server. Again I'd rather this take care of itself, but it > appears that is not going to be possible. > > > > > Adding user tunables has never been known to increase the > > > aggregate > > > amount of happiness in the universe. I really hope we can come up > > > with > > > a better administrative interface... ideally, none would be best. > > > > I agree that none would be best. It isn't clear to me that that is > > possible. > > At present, we really don't have enough experience with this > > functionality to be able to say what the trade-offs are. > > If we delay the functionality until we have the perfect interface, > > we may never get that experience. > > > > We can document "nconnect=" as a hint, and possibly add that > > "nconnect=1" is a firm guarantee that more will not be used. > > Agree that 1 should be the default. If we make this setting a > hint, then perhaps it should be renamed; nconnect makes it sound > like the client will always open N connections. How about "maxconn" ? > > Then, to better define the behavior: > > The range of valid maxconn values is 1 to 3? to 8? to NCPUS? to the > count of the client’s NUMA nodes? I’d be in favor of a small number > to start with. Solaris' experience with multiple connections is that > there is very little benefit past 8. > > If maxconn is specified with a datagram transport, does the mount > operation fail, or is the setting is ignored? It is ignored. > If maxconn is a hint, when does the client open additional > connections? As I've already stated, that functionality is not yet available. When it is, it will be under the control of a userspace daemon that can decide on a policy in accordance with a set of user specified requirements. > IMO documentation should be clear that this setting is not for the > purpose of multipathing/trunking (using multiple NICs on the client > or server). The client has to do trunking detection/discovery in that > case, and nconnect doesn't add that logic. This is strictly for > enabling multiple connections between one client-server IP address > pair. > > Do we need to state explicitly that all transport connections for a > mount (or client-server pair) are the same connection type (i.e., all > TCP or all RDMA, never a mix)? > > > > Then further down the track, we might change the actual number of > > connections automatically if a way can be found to do that without > > cost. > > Fair enough. > > > > Do you have any objections apart from the nconnect= mount option? > > Well I realize my last e-mail sounded a little negative, but I'm > actually in favor of adding the ability to open multiple connections > per client-server pair. I just want to be careful about making this > a feature that has as few downsides as possible right from the start. > I'll try to be more helpful in my responses. > > Remaining implementation issues that IMO need to be sorted: > > • We want to take care that the client can recover network resources > that have gone idle. Can we reuse the auto-close logic to close extra > connections? > • How will the client schedule requests on multiple connections? > Should we enable the use of different schedulers? > • How will retransmits be handled? > • How will the client recover from broken connections? Today's > clients > use disconnect to determine when to retransmit, thus there might be > some unwanted interactions here that result in mount hangs. > • Assume NFSv4.1 session ID rather than client ID trunking: is Linux > client support in place for this already? > • Are there any concerns about how the Linux server DRC will behave > in > multi-connection scenarios? Round and round the arguments goes.... Please see the earlier answers to all these questions > None of these seem like a deal breaker. And possibly several of these > are already decided, but just need to be published/documented. > > > -- > Chuck Lever > > > -- Trond Myklebust CTO, Hammerspace Inc 4300 El Camino Real, Suite 105 Los Altos, CA 94022 www.hammer.space