Re: umount failed - device is busy

Herta Van den Eynde <herta.vandeneynde@xxxxxxxxxxxxxx> · Fri, 05 May 2006 01:25:59 +0200

Herta Van den Eynde wrote:
Herta Van den Eynde wrote:

Lon Hohberger wrote:

On Tue, 2005-10-11 at 17:48 +0200, Herta Van den Eynde wrote:

Bit of extra information:  the system that was running the services 
got STONITHed by the other cluster member shortly before midnight.
The services all failed over nicely, but the situation remains:  if 
I try to stop or relocate a service, I get a "device is busy".
I suppose that rules out an intermittent issue.

There's no mounts below mounts.

Drat.

Nfsd is the most likely candidate for holding the reference.
Unfortunately, this is not something I can track down; you will have to
either file a support request and/or a Bugzilla.  When you get a chance,
you should definitely try stopping nfsd and seeing if that clears the
mystery references (allowing you to unmount).  If the problem comes from
nfsd, it should not be terribly difficult to track down.

Also, you should not need to recompile your kernel to probe all the LUNs
per device; just edit /etc/modules.conf:

options scsi_mod max_scsi_luns=128

... then run mkinitrd to rebuild the initrd image.

-- Lon

Next maintenance window is 4 weeks away, so I won't be able to test 
the nfsd hypothesis anytime soon.  In the meantime, I'll file a 
support request.  I'll keep you posted.

At least the unexpected STONITH confirms that the failover still works.

The /etc/modules.conf tip is a big time saver.  Rebuilding the modules 
takes forever.

Thanks, Lon.

Herta

Apologies for not updating this sooner.  (Thanks for remindeing me, Owen.)

During a later maintenance window, I shut down the cluster services, but 
it wasn't until I stopped the nfsd, that the filesystems could actually 
be unmounted, which seems to confirm Lon's theory about nfsd being the 
likely candidate for holding the reference.

I found a note elsewhere on the web where someone worked around the 
problem by stopping nfsd, stopping the service, restarting nfsd, and 
relocating the service.  Disadvantage being that all nfs services 
experience a minor interrupt at the time.

Anyway, my problem disappeared during the latest maintenance window. 
Both nfs-utils and clumanager were updated (nfs-utils-1.0.6-42EL -> 
nfs-utils-1.0.6-43EL, clumanager-1.2.28-1 -> clumanager-1.2.31-1), so 
I'm not 100% sure which of the two fixed it, and curious though I am, I 
simply don't have the time to start reading the code.  If anyone has 
further insights, I'd love to read about it, though.

Kind regards,

Herta

Someone reported off line that they are experiencing the same problem 
while running the same versions we currently are.

So just for completeness sake: expecting problems, I also upped the 
clumanager log levels during the last maintenance window.  They are now at:
   clumembd   loglevel="6"
   cluquorumd loglevel="6"
   clurmtabd  loglevel="7"
   clusvcmgrd loglevel="6"
   clulockd   loglevel="6"

Come to think of it, I probably loosened the log levels during the
maintenance window when our problems began (I wanted to reduce the size
of the logs).  Not sure how - or even if - this might affect things, though.

Kind regards,

Herta

Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster