Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



----- Original Message -----
> From: "Trond Myklebust" <trond.myklebust@xxxxxxxxxxxxxxx>
> To: "Andrew Martin" <amartin@xxxxxxxxxxx>
> Cc: "Jim Rees" <rees@xxxxxxxxx>, bhawley@xxxxxxxxxxx, "Brown Neil" <neilb@xxxxxxx>, linux-nfs-owner@xxxxxxxxxxxxxxx,
> linux-nfs@xxxxxxxxxxxxxxx
> Sent: Thursday, March 6, 2014 3:01:03 PM
> Subject: Re: Optimal NFS mount options to safely allow interrupts and timeouts on newer kernels
> 
> 
> On Mar 6, 2014, at 15:45, Andrew Martin <amartin@xxxxxxxxxxx> wrote:
> 
> > ----- Original Message -----
> >> From: "Trond Myklebust" <trond.myklebust@xxxxxxxxxxxxxxx>
> >>> I attempted to get a backtrace from one of the uninterruptable apache
> >>> processes:
> >>> echo w > /proc/sysrq-trigger
> >>> 
> >>> Here's one example:
> >>> [1227348.003904] apache2       D 0000000000000000     0 10175   1773
> >>> 0x00000004
> >>> [1227348.003906]  ffff8802813178c8 0000000000000082 0000000000015e00
> >>> 0000000000015e00
> >>> [1227348.003908]  ffff8801d88f03d0 ffff880281317fd8 0000000000015e00
> >>> ffff8801d88f0000
> >>> [1227348.003910]  0000000000015e00 ffff880281317fd8 0000000000015e00
> >>> ffff8801d88f03d0
> >>> [1227348.003912] Call Trace:
> >>> [1227348.003918]  [<ffffffffa00a5ca0>] ? rpc_wait_bit_killable+0x0/0x40
> >>> [sunrpc]
> >>> [1227348.003923]  [<ffffffffa00a5cc4>] rpc_wait_bit_killable+0x24/0x40
> >>> [sunrpc]
> >>> [1227348.003925]  [<ffffffff8156a41f>] __wait_on_bit+0x5f/0x90
> >>> [1227348.003930]  [<ffffffffa00a5ca0>] ? rpc_wait_bit_killable+0x0/0x40
> >>> [sunrpc]
> >>> [1227348.003932]  [<ffffffff8156a4c8>] out_of_line_wait_on_bit+0x78/0x90
> >>> [1227348.003934]  [<ffffffff81086790>] ? wake_bit_function+0x0/0x40
> >>> [1227348.003939]  [<ffffffffa00a6611>] __rpc_execute+0x191/0x2a0 [sunrpc]
> >>> [1227348.003945]  [<ffffffffa00a6746>] rpc_execute+0x26/0x30 [sunrpc]
> >> 
> >> That basically means that the process is hanging in the RPC layer,
> >> somewhere
> >> in the state machine. ‘echo 0 >/proc/sys/sunrpc/rpc_debug’ as the ‘root’
> >> user should give us a dump of which state these RPC calls are in. Can you
> >> please try that?
> > Yes I will definitely run that the next time it happens, but since it
> > occurs
> > sporadically (and I have not yet found a way to reproduce it on demand), it
> > could be days before it occurs again. I'll also run "netstat -tn" to check
> > the
> > TCP connections the next time this happens.
> 
> If you are comfortable applying patches and compiling your own kernels, then
> you might want to try applying the fix for a certain out-of-socket-buffer
> race that Neil reported, and that I suspect you may be hitting. The patch
> has been sent to the ‘stable kernel’ series, and so should appear soon in
> Debian’s own kernels, but if this is bothering you now, then go for it…
> 
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=06ea0bfe6e6043cb56a78935a19f6f8ebc636226
> 

Trond,

This problem has reoccurred, and I have captured the debug output that you requested:

echo 0 >/proc/sys/sunrpc/rpc_debug:
http://pastebin.com/9juDs2TW

echo w > /proc/sysrq-trigger ; dmesg:
http://pastebin.com/1vDx9bNf

netstat -tn:
http://pastebin.com/mjxqjmuL

One suggestion for debug was to attempt to run "umount -f /path/to/mountpoint"
repeatedly to attempt to send SIGKILL back up to the application. This always
returned "Device or resource busy" and I was unable to unmount the filesystem
until I used "mount -l". 

I was able to kill -9 all but two of the processes that were blocking in
uninterruptable sleep. Note that I was able to get lsof output on these
processes this time, and they all appeared to be blocking on access to a
single file on the nfs share. If I tried to cat said file from this client,
my terminal would block:
open("/path/to/file", O_RDONLY)        = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=42385, ...}) = 0
mmap(NULL, 1056768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb00f0dc000
read(3,

However, I could cat the file just fine from another nfs client. Does this 
additional information shed any light on the source of this problem?

Thanks,

Andrew





--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux