Re: [Linux-cluster] GFS 2 node hang in rm test

Daniel McNeil <daniel@xxxxxxxx> · Tue, 07 Dec 2004 08:53:14 -0800

On Tue, 2004-12-07 at 01:38, Patrick Caulfield wrote:
> On Mon, Dec 06, 2004 at 04:13:50PM -0800, Daniel McNeil wrote:
> > On Mon, 2004-12-06 at 11:45, Ken Preslan wrote:
> > > On Fri, Dec 03, 2004 at 03:08:00PM -0800, Daniel McNeil wrote:
> > 
> > 
> > Looking at the stack trace above and dissabling dlm.ko,
> > it looks like dlm_lock+0x319 is the call to dlm_lock_stage1().
> > looking at dlm_lock_stage1(), it looks like it is sleeping on
> > 	 down_write(&rsb->res_lock)
> > 
> > So now I have to find who is holding the res_lock.
> 
> That's consistent with the hang you reported before - in fact it's almost
> certainly the same thing. My guess is thet there is a dealock on res_lock
> somewhere . In which case I suspect it's going to be easier to find that one by
> reading code rather than running tests. res_lock should never be held for any
> extended period of time, but in your last set of tracebacks there was nothing
> obviously holding it - so I suspect something is sleeping with it.
> 
> 

I looked through the stack traces and did not see any other
processes that might be holding the lock.  There were only
3 other processes with stack traces in the dlm module and
they do not look like they are holding it.  That is
confusing.  I can think of 3 possibilites:

	1. forgetting to up the semaphore somewhere
	2. a process spinning in the kernel is holding it
	3. freed the structure containing the res_lock.

All of these seem unlikely to me.  I reviewed the code
last evening, the the up's and down's are closed together
and nothing looked obviously wrong.

I'll think about adding more debug output.

I ran it again last night and it ran 27 loops until 7am this
morning before hanging.  I'm still collecting info from this
hang.  At least it is reproducible.

Daniel