Re: [PATCH v1] NFSv4.1 provide mount option to toggle trunking discovery

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Thu, 24 Feb 2022 21:53:49 +0000

> On Feb 24, 2022, at 4:25 PM, Olga Kornievskaia <olga.kornievskaia@xxxxxxxxx> wrote:
> 
> On Thu, Feb 24, 2022 at 1:20 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> 
>>> On Feb 24, 2022, at 12:55 PM, Olga Kornievskaia <olga.kornievskaia@xxxxxxxxx> wrote:
>>> 
>>> On Thu, Feb 24, 2022 at 10:30 AM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>>>> 
>>>>> On Feb 23, 2022, at 12:40 PM, Olga Kornievskaia <olga.kornievskaia@xxxxxxxxx> wrote:
>>>>> 
>>>>> From: Olga Kornievskaia <kolga@xxxxxxxxxx>
>>>>> 
>>>>> Introduce a new mount option -- trunkdiscovery,notrunkdiscovery -- to
>>>>> toggle whether or not the client will engage in actively discovery
>>>>> of trunking locations.
>>>> 
>>>> An alternative solution might be to change the client's
>>>> probe to treat NFS4ERR_DELAY as "no trunking information
>>>> available" and then allow operation to proceed on the
>>>> known good transport.
>>> 
>>> I'm not sure what you mean about "the known good transport".
>> 
>> The transport on which the client sent the
>> GETATTR(fs_locations).
>> 
>> The NFS4ERR_DELAY response means the server has no other
>> trunks available "at this time."
> 
> But GETATTR(fs_locations) isn't only used for trunking query, it's
> used for filesystem location (migration) as well. Are we redefining
> what ERR_DELAY means in the context of trunking vs migration?

I don't think I'm redefining what is described in RFC 8881
Section 15.1.1.3. The meaning of that status code is still
the same; it's the client's recovery action that can be
made to be different.

During migration, NFS4ERR_DELAY holds off the client until
open and lock state has been transitioned to the destination
server. In that case DELAY has to serialize further operations
from the client, and waiting and retrying is the correct
response.

I mean, the client won't know the hostname of the destination
until the GETATTR(fs_locations) returns a successful result.

For trunking discovery, DELAY still means roughly -EAGAIN.
But it's up to the caller whether and when to try the
operation again. I'm suggesting that in the context of
trunking discovery, there's no need to halt progress
until trunking discovery succeeds. The discovery probe
can be dropped or retried in the background.

>>> I do object to treating a single ERR_DELAY during discovery as a
>>> permanent error as there are legitimate reasons to a delay in looking
>>> up the information that can be resolved in time by the server.
>>> However, I don't object to putting a time limit or number of tries on
>>> ERR_DELAY as safety wheels.
>> 
>> In the past, some have objected to /any/ delay added to
>> the NFS mount process.
> 
> I again would like to note that fs_locations is a file system
> attribute thus I would argue has to be treated as other file system
> attributes.

True, fs_locations, as it was originally defined, is a
per-filesystem attribute.

But I don't see how that is relevant to this issue. The
client doesn't have to wait for trunking information to
start its operation using the main transport.

>>> Lastly, I think perhaps we can do both have a mount option to toggle
>>> discovery as well as safeguard the discovery from broken servers?
>> 
>> I'd really rather not add a mount option for this
>> purpose unless you know of another reason why trunking
>> discovery needs to be disabled.
> 
> I don't offhand. I thought it is the simplest and most appropriate
> solution and perhaps inline with "migration/nomigration" option but I
> must be mistaken there.

The "migration" option was a last resort. There were
really no other options to deal with servers that depend
on non-uniform client IDs.

There is an argument to be made that we shouldn't have
added that mount option because it controls the behavior
of all the mounts on that client.

IMO you shouldn't use "migration" as any kind of
precedent.

--
Chuck Lever