On Thu, Feb 24, 2022 at 1:20 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > On Feb 24, 2022, at 12:55 PM, Olga Kornievskaia <olga.kornievskaia@xxxxxxxxx> wrote: > > > > On Thu, Feb 24, 2022 at 10:30 AM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > >> > >>> On Feb 23, 2022, at 12:40 PM, Olga Kornievskaia <olga.kornievskaia@xxxxxxxxx> wrote: > >>> > >>> From: Olga Kornievskaia <kolga@xxxxxxxxxx> > >>> > >>> Introduce a new mount option -- trunkdiscovery,notrunkdiscovery -- to > >>> toggle whether or not the client will engage in actively discovery > >>> of trunking locations. > >> > >> An alternative solution might be to change the client's > >> probe to treat NFS4ERR_DELAY as "no trunking information > >> available" and then allow operation to proceed on the > >> known good transport. > > > > I'm not sure what you mean about "the known good transport". > > The transport on which the client sent the > GETATTR(fs_locations). > > The NFS4ERR_DELAY response means the server has no other > trunks available "at this time." But GETATTR(fs_locations) isn't only used for trunking query, it's used for filesystem location (migration) as well. Are we redefining what ERR_DELAY means in the context of trunking vs migration? > > I don't > > think the ERR_DELAY is associated with a transport. Btw, if you saw a > > previous patch which restricts fs_location query to the main transport > > makes your statement even more confusing as it would mean there is no > > good transport. Or do you mean to say we should have trunking > > discovery done asynchronous to mount by a separate kernel thread and > > therefore not impact mount steps? > > Yes, something like that. > > Trunking discovery that is independent of the NFS mount > process should be the goal. In fact, trunking discovery > really ought to be done in user space. I agree it should be a goal of continuous management of trunking but the initial setup is a part of file system attributes discovery. fs_location is a file system attribute which is queried along with other attributes upon discovery of a file system. Thus I maintain that the current treatment of trunking discovery is valid. What is being described below is a set of goals for trunking that we have discussed before and are important. > - There is now a user/kernel API for managing transports > > - The trunking configuration on the server might change > during the lifetime of the mount, so periodic checking > is needed > > - Adding an extra round trip, especially one that might > be slowed by one or more NFS4ERR_DELAY replies, is > going to be a problem during a mount storm > > - There might be local policies that affect which network > paths to choose for trunking > > - The choice of transports might be made automatically > by an orchestrator > > - Tying this setting to a mount option is not appropriate > because the transports are shared amount multiple NFS > mounts > > > > I do object to treating a single ERR_DELAY during discovery as a > > permanent error as there are legitimate reasons to a delay in looking > > up the information that can be resolved in time by the server. > > However, I don't object to putting a time limit or number of tries on > > ERR_DELAY as safety wheels. > > In the past, some have objected to /any/ delay added to > the NFS mount process. I again would like to note that fs_locations is a file system attribute thus I would argue has to be treated as other file system attributes. > There's no reason to hold up the mount process -- the > client can try the trunking discovery probe again in a > few moments while the mount proceeds, can't it? Given that I suggested doing it asynchronous means I consider it a possible design though I think it increases the complexity of the system greatly (I'm not convinced that it's the right call to be done). > If that means handing the probe to a work queue or > leaving it to user space, that seems like a more > flexible choice. > > > > Lastly, I think perhaps we can do both have a mount option to toggle > > discovery as well as safeguard the discovery from broken servers? > > I'd really rather not add a mount option for this > purpose unless you know of another reason why trunking > discovery needs to be disabled. I don't offhand. I thought it is the simplest and most appropriate solution and perhaps inline with "migration/nomigration" option but I must be mistaken there. > The best solution is to fix the server implementations. > If that's not possible then the second best is to have > the client manage the situation without needing any > human intervention. > > Adding an administrative tunable is, to my mind, an > option of the very last resort. > > > -- > Chuck Lever > > >