Re: How to deal with such hanging processes?

Jeff Layton <jlayton@xxxxxxxxxx> · Sat, 28 Jan 2012 07:30:21 -0500

On Fri, 27 Jan 2012 21:33:35 +0100
Łukasz Maśko <masko@xxxxxxxxxxxxx> wrote:

> I have a Welland ME-752GNS NAS storage. it is capable to serve the files 
> only using FTP or CIFS protocol. To quickly transfer data I'm using FTP, but 
> if I want to mount the disk fot instance to browse my images or watch 
> movies, I'm forced to use cifs.
> 
> It seems to work, but not too well. First, I realise, that my problems come 
> mainly from poor CIFS implementation in the NAS firmware, but since it is 
> the only one I have now and I cannot afford to change it, I must somehow 
> live with it. The main problem is that quite often something happens with 
> the data transfer. First, it results in such entries in dmesg and logs:
> 
> [ 5743.489573] CIFS VFS: ignoring corrupt resume name
> [ 5743.553028] CIFS VFS: ignoring corrupt resume name
> [ 5743.652823] CIFS VFS: ignoring corrupt resume name
> [ 5744.822936] CIFS VFS: ignoring corrupt resume name
> [ 5758.608685] CIFS VFS: ignoring corrupt resume name
> [ 5770.010003] CIFS VFS: ignoring corrupt resume name
> [ 5792.937939] CIFS VFS: Send error in read = -512
> [ 5792.938948] CIFS VFS: No task to wake, unknown frame received! NumMids 2
> [ 5792.938958] Received Data is: : dump of 37 bytes of data at 0xf4f4b6c0
> [ 5792.938974]  60000000 424d53ff 0000a4a4 c0018000 . . . ` \xffffffff S M B 
> ¤ ¤ . . . . . Ŕ
> [ 5792.938988]  00000000 00000000 00000000 2e130006 . . . . . . . . . . . . 
> . . . .
> [ 5792.938996]  67950002 00000012 . . . g .
> 
> Especially that part with "CIFS VFS: ignoring corrupt resume name" is 
> happening very often, but it is not causing any major problems.
> Then, but not always, a process which is performing data transfer hangs and 
> I'm getting the following errors:
> 
> [ 6120.569517] INFO: task kio_file:12029 blocked for more than 120 seconds.
> [ 6120.569521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [ 6120.569525] kio_file        D e417bc30     0 12029   6037 0x00000004
> [ 6120.569533]  e417bcb4 00000086 e417bc30 e417bc30 e417bc38 31a0b404 
> 00000559 00000000
> [ 6120.569543]  c0724a80 e417bc58 c0724a80 f6707a80 f60ab180 f0c539c0 
> 00000020 00000000
> [ 6120.569552]  e427e3c0 e5602938 00000020 e560293c 000003b7 00000010 
> c0664940 e417bcb4
> [ 6120.569561] Call Trace:
> [ 6120.569575]  [<c0174f6c>] ? ktime_get_ts+0xdc/0x110
> [ 6120.569583]  [<c04eadc0>] schedule+0x30/0x50
> [ 6120.569588]  [<c04eae53>] io_schedule+0x73/0xb0
> [ 6120.569594]  [<c01d93c8>] sleep_on_page+0x8/0x10
> [ 6120.569599]  [<c04eb4d7>] __wait_on_bit_lock+0x47/0x90
> [ 6120.569604]  [<c01d93c0>] ? __lock_page+0x80/0x80
> [ 6120.569609]  [<c01d93b6>] __lock_page+0x76/0x80
> [ 6120.569616]  [<c016c2e0>] ? autoremove_wake_function+0x40/0x40
> [ 6120.569623]  [<c024ad6d>] __generic_file_splice_read+0x52d/0x550
> [ 6120.569630]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
> [ 6120.569636]  [<c03f94bb>] ? __alloc_skb+0x5b/0x210
> [ 6120.569640]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
> [ 6120.569647]  [<c03018fd>] ? _copy_from_user+0x3d/0x60
> [ 6120.569652]  [<c03f8f27>] ? skb_queue_tail+0x37/0x50
> [ 6120.569659]  [<c0484150>] ? unix_stream_sendmsg+0x3d0/0x420
> [ 6120.569665]  [<c0249600>] ? page_cache_pipe_buf_release+0x20/0x20
> [ 6120.569671]  [<c024ae24>] generic_file_splice_read+0x94/0x100
> [ 6120.569677]  [<c024ad90>] ? __generic_file_splice_read+0x550/0x550
> [ 6120.569682]  [<c02498f0>] do_splice_to+0x60/0x80
> [ 6120.569687]  [<c0249b2e>] splice_direct_to_actor+0xae/0x1d0
> [ 6120.569692]  [<c0249860>] ? do_splice_from+0x80/0x80
> [ 6120.569698]  [<c024afcd>] do_splice_direct+0x4d/0x70
> [ 6120.569705]  [<c02252e1>] do_sendfile+0x181/0x220
> [ 6120.569710]  [<c0226053>] sys_sendfile64+0x53/0xc0
> [ 6120.569716]  [<c04f391f>] sysenter_do_call+0x12/0x28
> 

The process here is stuck waiting for the page lock on a page. Quite
possibly that page is part of a file on a cifs filesystems.

> I'm unable to kill this process and it prevents the share from being 
> unmounted:
> 
> $ ps ax | grep kio_file
> 12029 ?        D      0:00 kdeinit4: kio_file [kdeinit] file 
> local:/home/users/ed/tmp/ksocket-ed/klauncherTi6038.slave-socket 
> local:/home/users/ed/tmp/ksocket-ed/dolphinU11997.slave-socket
> 

Right. D state is uninterruptible sleep, and you won't be able to kill
it until it wakes up and comes out of kernel space.

> So far I've learned, that I can do such combination: first, I can umount 
> this share with -l (lazy) option, but the process in question still exists. 
> Second, I can turn the NAS off, wait for a moment and turn it on again (I'm 
> not 100% sure if the restart of NAS is a must here, but it is working) and 
> reload the cifs.ko module. As a result, the process is gone and I can keep 
> on working. Till the problem occurs again...
> 
> I'm using PLD Linux (which is probably not important). I have a vanilla 
> kernel, right now it is 3.2.2 but the same happened since 2.6.x (the only 
> improve after changing to 3.2. is a big performance jump). I have cifs-
> utils-5.2 installed and I'm loading the cifs.ko module with the following 
> parameters:
> 
> echo_retries=1 cifs_max_pending=2
> 
> cifs_max_pending=2 is the most important, the higher the value, the more 
> often the problem occurs and 2 is the smallest possible.
> 
> Is there anything I can do in the side of my Linux box in such situation? I 
> cannot upgrade the NAS firmware for I have the latest version and probably 
> no newer will be released (it is closed-source). I cannot get rid of this 
> NAS either. At least for some time. The best would be of course to make cifs 
> work with my NAS anyway, but it's up to You, for I have not enough knowledge 
> about it.

The way to deal with them is to solve the problem that causes them to
hang in the first place. Once they're stuck like that, there's really
little you can do until the page lock is released. The messages from
the ring buffer suggest that the server is sending corrupt replies to
the requests. A network capture might help confirm that.

Is this the same NAS that requests a maxmpx of 1? If so, the fact that
cifs sends more than one request a time to this server might be the
ultimate cause.

Obviously the server should handle that situation without corrupting
its replies, but cifs is clearly broken in this regard and shouldn't be
sending more than one request at a time to such a server. I doubt
there's anything you can do until Steve fixes that bug.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html