Re: [PATCH v1] NFSv4.1 provide mount option to toggle trunking discovery

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Thu, 24 Feb 2022 18:20:01 +0000

> On Feb 24, 2022, at 12:55 PM, Olga Kornievskaia <olga.kornievskaia@xxxxxxxxx> wrote:
> 
> On Thu, Feb 24, 2022 at 10:30 AM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>> 
>>> On Feb 23, 2022, at 12:40 PM, Olga Kornievskaia <olga.kornievskaia@xxxxxxxxx> wrote:
>>> 
>>> From: Olga Kornievskaia <kolga@xxxxxxxxxx>
>>> 
>>> Introduce a new mount option -- trunkdiscovery,notrunkdiscovery -- to
>>> toggle whether or not the client will engage in actively discovery
>>> of trunking locations.
>> 
>> An alternative solution might be to change the client's
>> probe to treat NFS4ERR_DELAY as "no trunking information
>> available" and then allow operation to proceed on the
>> known good transport.
> 
> I'm not sure what you mean about "the known good transport".

The transport on which the client sent the
GETATTR(fs_locations).

The NFS4ERR_DELAY response means the server has no other
trunks available "at this time."

> I don't
> think the ERR_DELAY is associated with a transport. Btw, if you saw a
> previous patch which restricts fs_location query to the main transport
> makes your statement even more confusing as it would mean there is no
> good transport. Or do you mean to say we should have trunking
> discovery done asynchronous to mount by a separate kernel thread and
> therefore not impact mount steps?

Yes, something like that.

Trunking discovery that is independent of the NFS mount
process should be the goal. In fact, trunking discovery
really ought to be done in user space.

- There is now a user/kernel API for managing transports

- The trunking configuration on the server might change
  during the lifetime of the mount, so periodic checking
  is needed

- Adding an extra round trip, especially one that might
  be slowed by one or more NFS4ERR_DELAY replies, is
  going to be a problem during a mount storm

- There might be local policies that affect which network
  paths to choose for trunking

- The choice of transports might be made automatically
  by an orchestrator

- Tying this setting to a mount option is not appropriate
  because the transports are shared amount multiple NFS
  mounts

> I do object to treating a single ERR_DELAY during discovery as a
> permanent error as there are legitimate reasons to a delay in looking
> up the information that can be resolved in time by the server.
> However, I don't object to putting a time limit or number of tries on
> ERR_DELAY as safety wheels.

In the past, some have objected to /any/ delay added to
the NFS mount process.

There's no reason to hold up the mount process -- the
client can try the trunking discovery probe again in a
few moments while the mount proceeds, can't it?

If that means handing the probe to a work queue or
leaving it to user space, that seems like a more
flexible choice.

> Lastly, I think perhaps we can do both have a mount option to toggle
> discovery as well as safeguard the discovery from broken servers?

I'd really rather not add a mount option for this
purpose unless you know of another reason why trunking
discovery needs to be disabled.

The best solution is to fix the server implementations.
If that's not possible then the second best is to have
the client manage the situation without needing any
human intervention.

Adding an administrative tunable is, to my mind, an
option of the very last resort.

--
Chuck Lever