Herta Van den Eynde wrote:
Herta Van den Eynde wrote:
Lon Hohberger wrote:
On Tue, 2005-10-11 at 17:48 +0200, Herta Van den Eynde wrote:
Bit of extra information: the system that was running the services
got STONITHed by the other cluster member shortly before midnight.
The services all failed over nicely, but the situation remains: if
I try to stop or relocate a service, I get a "device is busy".
I suppose that rules out an intermittent issue.
There's no mounts below mounts.
Drat.
Nfsd is the most likely candidate for holding the reference.
Unfortunately, this is not something I can track down; you will have to
either file a support request and/or a Bugzilla. When you get a chance,
you should definitely try stopping nfsd and seeing if that clears the
mystery references (allowing you to unmount). If the problem comes from
nfsd, it should not be terribly difficult to track down.
Also, you should not need to recompile your kernel to probe all the LUNs
per device; just edit /etc/modules.conf:
options scsi_mod max_scsi_luns=128
... then run mkinitrd to rebuild the initrd image.
-- Lon
Next maintenance window is 4 weeks away, so I won't be able to test
the nfsd hypothesis anytime soon. In the meantime, I'll file a
support request. I'll keep you posted.
At least the unexpected STONITH confirms that the failover still works.
The /etc/modules.conf tip is a big time saver. Rebuilding the modules
takes forever.
Thanks, Lon.
Herta
Apologies for not updating this sooner. (Thanks for remindeing me, Owen.)
During a later maintenance window, I shut down the cluster services, but
it wasn't until I stopped the nfsd, that the filesystems could actually
be unmounted, which seems to confirm Lon's theory about nfsd being the
likely candidate for holding the reference.
I found a note elsewhere on the web where someone worked around the
problem by stopping nfsd, stopping the service, restarting nfsd, and
relocating the service. Disadvantage being that all nfs services
experience a minor interrupt at the time.
Anyway, my problem disappeared during the latest maintenance window.
Both nfs-utils and clumanager were updated (nfs-utils-1.0.6-42EL ->
nfs-utils-1.0.6-43EL, clumanager-1.2.28-1 -> clumanager-1.2.31-1), so
I'm not 100% sure which of the two fixed it, and curious though I am, I
simply don't have the time to start reading the code. If anyone has
further insights, I'd love to read about it, though.
Kind regards,
Herta
Someone reported off line that they are experiencing the same problem
while running the same versions we currently are.
So just for completeness sake: expecting problems, I also upped the
clumanager log levels during the last maintenance window. They are now at:
clumembd loglevel="6"
cluquorumd loglevel="6"
clurmtabd loglevel="7"
clusvcmgrd loglevel="6"
clulockd loglevel="6"
Come to think of it, I probably loosened the log levels during the
maintenance window when our problems began (I wanted to reduce the size
of the logs). Not sure how - or even if - this might affect things, though.
Kind regards,
Herta
Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster