Re: lost interrupt after a signal?

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 9 Dec 2008 17:52:10 -0500

On May 27, 2008, at May 27, 2008, 1:35 PM, Matthew Wilcox wrote:
On Tue, May 27, 2008 at 11:59:00AM -0400, Chuck Lever wrote:
This isn't jumping out screaming that it's my fault (obviously it
probably is, but ...).  invalidate_inode_pages2_range calls
lock_page()
... which uses TASK_UNINTERRUPTIBLE.  If it were calling
lock_page_killable(), I'd understand.

I don't think it's directly caused by your changes, but my concern is
that you may have exposed a latent bug, or exposed an underlying
design assumption in the NFS/RPC client stack that causes the hang in
this situation.

Certainly possible.

Maybe this isn't the problem task though.  Maybe this is just the
canary that dropped dead, and we should stop trying to autopsy it  
and
start running.  [ok, I'll stop with the bad analogies now]

This appears to be the only task that is in this state.  All the
others in the dump are waiting for this inode's mutex.  I don't know
if the dump is complete, though.

My thought is that the task which caused the problem has gone away and
left this page in a state where sync_page will never finish.

One thing to note: NFS doesn't have a sync_page() a_op.  So this  
shouldn't be the problem, right?

I've passed your suggestions along to our testers.

Thanks!  I'm keen to get this fixed.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html