On Tue, 2004-12-07 at 01:38, Patrick Caulfield wrote: > On Mon, Dec 06, 2004 at 04:13:50PM -0800, Daniel McNeil wrote: > > On Mon, 2004-12-06 at 11:45, Ken Preslan wrote: > > > On Fri, Dec 03, 2004 at 03:08:00PM -0800, Daniel McNeil wrote: > > > > > > Looking at the stack trace above and dissabling dlm.ko, > > it looks like dlm_lock+0x319 is the call to dlm_lock_stage1(). > > looking at dlm_lock_stage1(), it looks like it is sleeping on > > down_write(&rsb->res_lock) > > > > So now I have to find who is holding the res_lock. > > That's consistent with the hang you reported before - in fact it's almost > certainly the same thing. My guess is thet there is a dealock on res_lock > somewhere . In which case I suspect it's going to be easier to find that one by > reading code rather than running tests. res_lock should never be held for any > extended period of time, but in your last set of tracebacks there was nothing > obviously holding it - so I suspect something is sleeping with it. > > I looked through the stack traces and did not see any other processes that might be holding the lock. There were only 3 other processes with stack traces in the dlm module and they do not look like they are holding it. That is confusing. I can think of 3 possibilites: 1. forgetting to up the semaphore somewhere 2. a process spinning in the kernel is holding it 3. freed the structure containing the res_lock. All of these seem unlikely to me. I reviewed the code last evening, the the up's and down's are closed together and nothing looked obviously wrong. I'll think about adding more debug output. I ran it again last night and it ran 27 loops until 7am this morning before hanging. I'm still collecting info from this hang. At least it is reproducible. Daniel