(resending this as it bounced off the list - I accidentally embedded
HTML)
Yes, if you're pretty sure your hostnames are all different, the
client_ids
should be different. For v4.0 you can turn on debugging (rpcdebug -m
nfs -s
proc) and see the client_id in the kernel log in lines that look like:
"NFS
call setclientid auth=%s, '%s'\n", which will happen at mount time, but
it
doesn't look like we have any debugging for v4.1 and v4.2 for
EXCHANGE_ID.
You can extract it via the crash utility, or via systemtap, or by doing
a
wire capture, but nothing that's easily translated to running across a
large
number of machines. There's probably other ways, perhaps we should tack
that string into the tracepoints for exchange_id and setclientid.
If you're interested in troubleshooting, wire capture's usually the most
informative. If the lockup events all happen at the same time, there
might be some network event that is triggering the issue.
You should expect NFSv4.1 to be rock-solid. Its rare we have reports
that it isn't, and I'd love to know why you're having these problems.
Ben
On 13 Apr 2021, at 11:38, hedrick@xxxxxxxxxxx wrote:
The server is ubuntu 20, with a ZFS file system.
I don’t set the unique ID. Documentation claims that it is set from
the hostname. They will surely be unique, or the whole world would
blow up. How can I check the actual unique ID being used? The kernel
reports a blank one, but I think that just means to use the hostname.
We could obviously set a unique one if that would be useful.
On Apr 13, 2021, at 11:35 AM, Benjamin Coddington
<bcodding@xxxxxxxxxx> wrote:
It would be interesting to know why your clients are failing to
reclaim their locks. Something is misconfigured. What server are
you using, and is there anything fancy on the server-side (like HA)?
Is it possible that you have clients with the same nfs4_unique_id?
Ben
On 13 Apr 2021, at 11:17, hedrick@xxxxxxxxxxx wrote:
many, though not all, of the problems are “lock reclaim failed”.
On Apr 13, 2021, at 10:52 AM, Patrick Goetz
<pgoetz@xxxxxxxxxxxxxxx> wrote:
I use NFS 4.2 with Ubuntu 18/20 workstations and Ubuntu 18/20
servers and haven't had any problems.
Check your configuration files; the last time I experienced
something like this it's because I inadvertently used the same fsid
on two different exports. Also recommend exporting top level
directories only. Bind mount everything you want to export into
/srv/nfs and only export those directories. According to Bruce F.
this doesn't buy you any security (I still don't understand why),
but it makes for a cleaner system configuration.
On 4/13/21 9:33 AM, hedrick@xxxxxxxxxxx wrote:
I am in charge of a large computer science dept computing
infrastructure. We have a variety of student and develo9pment
users. If there are problems we’ll see them.
We use an Ubuntu 20 server, with NVMe storage.
I’ve just had to move Centos 7 and Ubuntu 18 to use NFS 4.0. We
had hangs with NFS 4.1 and 4.2. Files would appear to be locked,
although eventually the lock would time out. It’s too soon to be
sure that moving back to NFS 4.0 will fix it. Next is either NFS 3
or disabling delegations on the server.
Are there known versions of NFS that are safe to use in production
for various kernel versions? The one we’re most interested in is
Ubuntu 20, which can be anything from 5.4 to 5.8.