Hanging writes after upgrading "clients" to debian squeeze

sbecker at rapidsoft.de (Stefan Becker) · Sun, 5 Feb 2012 23:03:16 +0100

Hi Brian,

ok tcpdump and strace is "raining". He is doing a lot and he is connected to .40 and .41. During my tests now I found out that the hangs are random. There is one touch I have been trying the whole day and this one is still hanging. Other touches (other directories/filesnames) work randomly. What I found now using dmesg is:

[17514.155548] INFO: task touch:25873 blocked for more than 120 seconds.
[17514.155583] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[17514.155627] touch         D 0000000000000000     0 25873  20727 0x00000004
[17514.155630]  ffff88023f0e2a60 0000000000000086 0000000000000000 ffffffff81049fd0
[17514.155633]  ffff8801dc7c3c28 000000000000f9e0 ffff8801dc7c3fd8 0000000000015780
[17514.155635]  0000000000015780 ffff88023bc8e2e0 ffff88023bc8e5d8 000000058103a866
[17514.155638] Call Trace:
[17514.155644]  [<ffffffff81049fd0>] ? try_to_wake_up+0x249/0x259
[17514.155646]  [<ffffffff8103f80c>] ? __wake_up+0x30/0x44
[17514.155651]  [<ffffffffa0181ab1>] ? fuse_request_send+0x1a2/0x255 [fuse]
[17514.155654]  [<ffffffff810649da>] ? autoremove_wake_function+0x0/0x2e
[17514.155657]  [<ffffffffa0182992>] ? fuse_request_alloc+0x22/0x27 [fuse]
[17514.155660]  [<ffffffffa0187da2>] ? fuse_file_alloc+0xc4/0xeb [fuse]
[17514.155663]  [<ffffffffa0184646>] ? fuse_create+0x1ce/0x38f [fuse]
[17514.155667]  [<ffffffff810f7180>] ? vfs_create+0x6d/0x89
[17514.155669]  [<ffffffff810f80a9>] ? do_filp_open+0x31e/0x94b
[17514.155673]  [<ffffffff810cc2d5>] ? handle_mm_fault+0x3b8/0x80f
[17514.155676]  [<ffffffff810ec8af>] ? do_sys_open+0x55/0xfc
[17514.155678]  [<ffffffff81010b42>] ? system_call_fastpath+0x16/0x1b

Does not look too good. This reminds me about a kernel/gluster bug I had a year ago and could only be fixed turning off "quickread" (it still is off, I checked).

Any other ideas?

-
Stefan

-----Urspr?ngliche Nachricht-----
Von: gluster-users-bounces at gluster.org [mailto:gluster-users-bounces at gluster.org] Im Auftrag von Stefan Becker
Gesendet: Sonntag, 5. Februar 2012 22:30
An: Brian Candler
Cc: gluster-users at gluster.org
Betreff: Re: Hanging writes after upgrading "clients" to debian squeeze

Hi Brian,

thanks for your help, I will play around with what you said and come back with results or the solution :)

Greets,
Stefan

-----Urspr?ngliche Nachricht-----
Von: Brian Candler [mailto:B.Candler at pobox.com] 
Gesendet: Sonntag, 5. Februar 2012 22:28
An: Stefan Becker
Cc: Whit Blauvelt; gluster-users at gluster.org
Betreff: Re: Hanging writes after upgrading "clients" to debian squeeze

On Sun, Feb 05, 2012 at 09:49:47PM +0100, Stefan Becker wrote:
> - no ip tables involved

OK. So how about this on the client:

tcpdump -i eth0 -nn host 10.10.100.40 or host 10.10.100.41

(replace eth0 as necessary)

That will show you traffic to and from the bricks. When you issue a write
(e.g. touch /path/to/foo), does traffic only go out to one brick? Do you
see any TCP retransmissions? Does 'netstat -nt' show TCP connections to both
bricks? Does Send-Q stay at zero most of the time, or is it stuck at a
non-zero value?

You could also try:
  strace -p <pid-of-glusterfs-process>
on the client as well. You should see writev(fd,...) and readv(fd,...) with
different fds for communication to each of the bricks. Then try issuing
a single write.

The strace output may not tell you much by itself, but if you compare what
you see on a non-upgraded (working) client versus an upgraded (broken)
client, you might be able to see what it's getting stuck on.

Regards,

Brian.
_______________________________________________
Gluster-users mailing list
Gluster-users at gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users