How to deal with such hanging processes?

Łukasz Maśko <masko@xxxxxxxxxxxxx> · Fri, 27 Jan 2012 21:33:35 +0100

I have a Welland ME-752GNS NAS storage. it is capable to serve the files 
only using FTP or CIFS protocol. To quickly transfer data I'm using FTP, but 
if I want to mount the disk fot instance to browse my images or watch 
movies, I'm forced to use cifs.

It seems to work, but not too well. First, I realise, that my problems come 
mainly from poor CIFS implementation in the NAS firmware, but since it is 
the only one I have now and I cannot afford to change it, I must somehow 
live with it. The main problem is that quite often something happens with 
the data transfer. First, it results in such entries in dmesg and logs:

[ 5743.489573] CIFS VFS: ignoring corrupt resume name
[ 5743.553028] CIFS VFS: ignoring corrupt resume name
[ 5743.652823] CIFS VFS: ignoring corrupt resume name
[ 5744.822936] CIFS VFS: ignoring corrupt resume name
[ 5758.608685] CIFS VFS: ignoring corrupt resume name
[ 5770.010003] CIFS VFS: ignoring corrupt resume name
[ 5792.937939] CIFS VFS: Send error in read = -512
[ 5792.938948] CIFS VFS: No task to wake, unknown frame received! NumMids 2
[ 5792.938958] Received Data is: : dump of 37 bytes of data at 0xf4f4b6c0
[ 5792.938974]  60000000 424d53ff 0000a4a4 c0018000 . . . ` \xffffffff S M B 
¤ ¤ . . . . . Ŕ
[ 5792.938988]  00000000 00000000 00000000 2e130006 . . . . . . . . . . . . 
. . . .
[ 5792.938996]  67950002 00000012 . . . g .

Especially that part with "CIFS VFS: ignoring corrupt resume name" is 
happening very often, but it is not causing any major problems.
Then, but not always, a process which is performing data transfer hangs and 
I'm getting the following errors:

[ 6120.569517] INFO: task kio_file:12029 blocked for more than 120 seconds.
[ 6120.569521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[ 6120.569525] kio_file        D e417bc30     0 12029   6037 0x00000004
[ 6120.569533]  e417bcb4 00000086 e417bc30 e417bc30 e417bc38 31a0b404 
00000559 00000000
[ 6120.569543]  c0724a80 e417bc58 c0724a80 f6707a80 f60ab180 f0c539c0 
00000020 00000000
[ 6120.569552]  e427e3c0 e5602938 00000020 e560293c 000003b7 00000010 
c0664940 e417bcb4
[ 6120.569561] Call Trace:
[ 6120.569575]  [<c0174f6c>] ? ktime_get_ts+0xdc/0x110
[ 6120.569583]  [<c04eadc0>] schedule+0x30/0x50
[ 6120.569588]  [<c04eae53>] io_schedule+0x73/0xb0
[ 6120.569594]  [<c01d93c8>] sleep_on_page+0x8/0x10
[ 6120.569599]  [<c04eb4d7>] __wait_on_bit_lock+0x47/0x90
[ 6120.569604]  [<c01d93c0>] ? __lock_page+0x80/0x80
[ 6120.569609]  [<c01d93b6>] __lock_page+0x76/0x80
[ 6120.569616]  [<c016c2e0>] ? autoremove_wake_function+0x40/0x40
[ 6120.569623]  [<c024ad6d>] __generic_file_splice_read+0x52d/0x550
[ 6120.569630]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
[ 6120.569636]  [<c03f94bb>] ? __alloc_skb+0x5b/0x210
[ 6120.569640]  [<c03f535c>] ? sock_alloc_send_pskb+0x15c/0x290
[ 6120.569647]  [<c03018fd>] ? _copy_from_user+0x3d/0x60
[ 6120.569652]  [<c03f8f27>] ? skb_queue_tail+0x37/0x50
[ 6120.569659]  [<c0484150>] ? unix_stream_sendmsg+0x3d0/0x420
[ 6120.569665]  [<c0249600>] ? page_cache_pipe_buf_release+0x20/0x20
[ 6120.569671]  [<c024ae24>] generic_file_splice_read+0x94/0x100
[ 6120.569677]  [<c024ad90>] ? __generic_file_splice_read+0x550/0x550
[ 6120.569682]  [<c02498f0>] do_splice_to+0x60/0x80
[ 6120.569687]  [<c0249b2e>] splice_direct_to_actor+0xae/0x1d0
[ 6120.569692]  [<c0249860>] ? do_splice_from+0x80/0x80
[ 6120.569698]  [<c024afcd>] do_splice_direct+0x4d/0x70
[ 6120.569705]  [<c02252e1>] do_sendfile+0x181/0x220
[ 6120.569710]  [<c0226053>] sys_sendfile64+0x53/0xc0
[ 6120.569716]  [<c04f391f>] sysenter_do_call+0x12/0x28

I'm unable to kill this process and it prevents the share from being 
unmounted:

$ ps ax | grep kio_file
12029 ?        D      0:00 kdeinit4: kio_file [kdeinit] file 
local:/home/users/ed/tmp/ksocket-ed/klauncherTi6038.slave-socket 
local:/home/users/ed/tmp/ksocket-ed/dolphinU11997.slave-socket

So far I've learned, that I can do such combination: first, I can umount 
this share with -l (lazy) option, but the process in question still exists. 
Second, I can turn the NAS off, wait for a moment and turn it on again (I'm 
not 100% sure if the restart of NAS is a must here, but it is working) and 
reload the cifs.ko module. As a result, the process is gone and I can keep 
on working. Till the problem occurs again...

I'm using PLD Linux (which is probably not important). I have a vanilla 
kernel, right now it is 3.2.2 but the same happened since 2.6.x (the only 
improve after changing to 3.2. is a big performance jump). I have cifs-
utils-5.2 installed and I'm loading the cifs.ko module with the following 
parameters:

echo_retries=1 cifs_max_pending=2

cifs_max_pending=2 is the most important, the higher the value, the more 
often the problem occurs and 2 is the smallest possible.

Is there anything I can do in the side of my Linux box in such situation? I 
cannot upgrade the NAS firmware for I have the latest version and probably 
no newer will be released (it is closed-source). I cannot get rid of this 
NAS either. At least for some time. The best would be of course to make cifs 
work with my NAS anyway, but it's up to You, for I have not enough knowledge 
about it.
-- 
Łukasz Maśko                                                            _o)
Lukasz.Masko(at)ipipan.waw.pl                                           /\\
Registered Linux User #61028                                           _\_V
--
To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html