Re: [RFC] server's statd and lockd will not sync after its nfslock restart

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 17 Dec 2009 15:34:35 -0500

On Dec 17, 2009, at 3:27 PM, Trond Myklebust wrote:

On Thu, 2009-12-17 at 11:18 -0500, Chuck Lever wrote:
On Dec 17, 2009, at 5:07 AM, Mi Jinlong wrote:
Chuck Lever :
On Dec 16, 2009, at 5:27 AM, Mi Jinlong wrote:
Chuck Lever:
On Dec 15, 2009, at 5:02 AM, Mi Jinlong wrote:
Hi,

...snip...

The Primary Reason:

At step3, when client's reclaimed lock request is sent to  
server,
client's host(the host struct) is reused but not be re-monitored
at
server's lockd. After that, statd and lockd are not sync.

The kernel squashes SM_MON upcalls for hosts that it already
believes
are monitored.  This is a scalability feature.

When statd start, it will move files from /var/lib/nfs/statd/sm/  
to
/var/lib/nfs/statd/sm.bak/.

Well, it's really sm-notify that does this.  sm-notify is run by
rpc.statd when it starts up.

However, sm-notify should only retire the monitor list the first
time it
is run after a reboot.  Simply restarting statd should not change  
the
on-disk monitor list in the slightest.  If it does, there's some
kind of
problem with the way sm-notify's pid file is managed, or perhaps  
with
the nfslock script.

When starting, statd will call run_sm_notify() function to run sm-
notify.
Using command "service nfslock restart" will case statd stop and
start,
so sm-notify will be run. If sm-notify run, the on-disk monitor list
will be changed.

If lockd don't send a SM_MON to statd,
statd will not monitor those client which be monitored before  
statd
restart.

Question:

In my opinion, if lockd is allowed reuseing the client's host,  
it
should
send a SM_MON to statd when reuse. If not allowed, the client's
host
should
be destroyed immediately.

What should lockd to do?  Reuse ? Destroy ? Or some other  
action?

I don't immediately see why lockd should change it's behavior.
Perhaps
statd/sm-notify were incorrect to delete the monitor list when  
you
restarted the nfslock service?

Sorry, maybe i did not express clearly.
I mean, lockd reuse the host struct which was created before statd
restart.

It seems have deleted the monitor list when nfslock restart.

lockd does not touch any user space files; the on-disk monitor list
is
managed by statd and sm-notify.  A remote peer rebooting does not
clear
the "monitored" flag for that peer in the local kernel's lockd,  
so it
won't send another SM_MON request.

Yes, that's right.

But, this case refers to server's lockd, not the remote peer.
I thank, when local system's nfslock restart, local kernel's lockd
clear all other client's host strcut's "monitored" flag.

Now, it may be the case that "service nfslock start" uses a command
line
option that forces a fresh sm-notify run, and that is what is
wiping the
on-disk monitor list.  That would be the bug in this case -- sm-
notify
can and should be allowed to make its own determination of whether
the
monitor list gets retired.  Notification should not normally be
forced
by command line options in the nfslock script.

A fresh sm-notify run is cause by statd start.
I find it through codes by followed.

utils/statd/statd.c
...
478         if (! (run_mode & MODE_NO_NOTIFY))
479                 switch (pid = fork()) {
480                 case 0:
481                         run_sm_notify(out_port);
482                         break;
483                 case -1:
484                         break;
485                 default:
486                         waitpid(pid, NULL, 0);
487                 }
....

I thank, when statd restart and call sm-notify, the on-disk monitor
list will
be deleted, so lockd should clear all other client's host strcut's
"monitored" flag.
After that, a reused host struct will be re-monitored, a on-disk
monitor
will be re-created. Like that, lockd and statd will sync .

run_sm_notify() simply forks and execs the sm-notify program.  This
program checks for the existence of a pid file.  If the pid file
exists, then sm-notify exits.  If it does not, then sm-notify retires
the records in /var/lib/nfs/statd/sm and posts reboot notifications.

Jeff Layton pointed out to me yesterday that Red Hat's nfslock script
unconditionally deletes sm-notify's pid file every time "service
nfslock start" is done, which effectively defeats sm-notify's reboot
detection.

sm-notify was written by a developer at SuSE.  SuSE Linux uses a  
tmpfs
for /var/run, but Red Hat uses permanent storage for this directory.
Thus on SuSE, the pid file gets deleted automatically by a reboot,  
but
on Red Hat, the pid file must be deleted "by hand" or reboot
notification never occurs.

So the root cause of this problem is that the current mechanism sm-
notify uses to detect a reboot is not portable across distributions.

My new-statd prototype used a semaphor instead of a pid file to  
detect
reboots.  A semaphor is shared (visible to other processes) and will
continue to exist until it is deleted or the system reboots.  It is a
resource that is not destroyed automatically when the sm-notify
process exits.  If creating the semaphor fails, sm-notify exits.  If
creating it succeeds, it runs.

Would anyone strongly object to using a semaphor instead of a pid  
file
here?  Is support for semaphors always built into kernels?  Would
there be any problems with the small size of the semaphor name space?
Is there another similar facility that might be better?

One alternative might be to just record the kernel's random boot_id in
the pid file. That gets regenerated on each boot, so should be unique.

Where do you get it in user space?  Is it available on earlier  
kernels?  ("should be unique" -- I hope it doesn't have the same  
problem we had with XID replay on diskless systems).

Fwiw, I tried using the boot time stamp at one point, but  
unfortunately that's adjusted by the ntp offset, so it can take  
different values over time.  It was difficult to compare it to a time  
stamp recorded in a file.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html