Hi,
crawling through all /var/log/messages, I found on one of the failing
nodes (node68)
Nov 25 04:04:12 node68 kernel: INFO: task pw.x:20052 blocked for more
than 120 seconds.
Nov 25 04:04:12 node68 kernel: "echo 0 > /proc/sys/kernel/
hung_task_timeout_secs" disables this message.
Nov 25 04:04:12 node68 kernel: pw.x D ffff81027c3d5d68 0
20052 1
Nov 25 04:04:12 node68 kernel: ffff81027c3d5d48 0000000000000086
ffff81021c0e7460 0000000000000000
Nov 25 04:04:12 node68 kernel: ffff81041f14e800 000000038022a7ae
ffff81041f314238 ffff81041f314000
Nov 25 04:04:12 node68 kernel: 0000000000000000 0000000000000001
0000000000000246 0000000000000003
Nov 25 04:04:12 node68 kernel: Call Trace:
Nov 25 04:04:12 node68 kernel: [<ffffffff882ae9c7>] :fuse:request_send
+0x2c8/0x2f0
Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>]
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>]
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel:
[<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38
Nov 25 04:04:12 node68 kernel:
[<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e
Nov 25 04:04:12 node68 kernel: [<ffffffff882b188e>] :fuse:fuse_open
+0x0/0x7
Nov 25 04:04:12 node68 kernel: [<ffffffff80286e30>] __dentry_open
+0xe6/0x1ba
Nov 25 04:04:12 node68 kernel: [<ffffffff80286f2a>] nameidata_to_filp
+0x26/0x35
Nov 25 04:04:12 node68 kernel: [<ffffffff80286f66>] do_filp_open+0x2d/
0x3d
Nov 25 04:04:12 node68 kernel: [<ffffffff80287180>]
get_unused_fd_flags+0x104/0x113
Nov 25 04:04:12 node68 kernel: [<ffffffff802872a3>] do_sys_open
+0x46/0xc3
Nov 25 04:04:12 node68 kernel: [<ffffffff8020b08b>]
system_call_after_swapgs+0x7b/0x80
Nov 25 04:04:12 node68 kernel:
Nov 25 04:04:12 node68 kernel: INFO: task pw.x:20053 blocked for more
than 120 seconds.
Nov 25 04:04:12 node68 kernel: "echo 0 > /proc/sys/kernel/
hung_task_timeout_secs" disables this message.
Nov 25 04:04:12 node68 kernel: pw.x D ffff8101c5083d68 0
20053 1
Nov 25 04:04:12 node68 kernel: ffff8101c5083d48 0000000000000086
ffff81021c0e7460 0000000000000000
Nov 25 04:04:12 node68 kernel: ffff81041f14a800 000000008022a7ae
ffff81021d8b9238 ffff81021d8b9000
Nov 25 04:04:12 node68 kernel: 0000000000000000 0000000000000001
0000000000000246 0000000000000003
Nov 25 04:04:12 node68 kernel: Call Trace:
Nov 25 04:04:12 node68 kernel: [<ffffffff882ae9c7>] :fuse:request_send
+0x2c8/0x2f0
Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>]
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel: [<ffffffff80242ab3>]
autoremove_wake_function+0x0/0x2e
Nov 25 04:04:12 node68 kernel:
[<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38
Nov 25 04:04:12 node68 kernel:
[<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e
Nov 25 04:04:12 node68 kernel: [<ffffffff882b188e>] :fuse:fuse_open
+0x0/0x7
Nov 25 04:04:12 node68 kernel: [<ffffffff80286e30>] __dentry_open
+0xe6/0x1ba
Nov 25 04:04:12 node68 kernel: [<ffffffff80286f2a>] nameidata_to_filp
+0x26/0x35
Nov 25 04:04:12 node68 kernel: [<ffffffff80286f66>] do_filp_open+0x2d/
0x3d
Nov 25 04:04:12 node68 kernel: [<ffffffff80287180>]
get_unused_fd_flags+0x104/0x113
Nov 25 04:04:12 node68 kernel: [<ffffffff802872a3>] do_sys_open
+0x46/0xc3
Nov 25 04:04:12 node68 kernel: [<ffffffff8020b08b>]
system_call_after_swapgs+0x7b/0x80
Nov 25 04:04:12 node68 kernel:
The other two failing nodes had nothing related in the logs. Note that
pw.x:20052 and pw.x:20053 are the two parallel jobs running on this
node.
A similar error was logged during the crash two days ago on node22:
Nov 23 14:16:43 node22 kernel: INFO: task pw.x:32355 blocked for more
than 120 seconds.
Nov 23 14:16:43 node22 kernel: "echo 0 > /proc/sys/kernel/
hung_task_timeout_secs" disables this message.
Nov 23 14:16:43 node22 kernel: pw.x D ffff8102049c1d68 0
32355 1
Nov 23 14:16:43 node22 kernel: ffff8102049c1d48 0000000000000082
ffff81013e0e1c60 0000000000000000
Nov 23 14:16:43 node22 kernel: ffff81021e4ea000 000000038022a7ae
ffff81021f004a38 ffff81021f004800
Nov 23 14:16:43 node22 kernel: 0000000000000000 0000000000000001
0000000000000246 0000000000000003
Nov 23 14:16:43 node22 kernel: Call Trace:
Nov 23 14:16:43 node22 kernel: [<ffffffff882ae9c7>] :fuse:request_send
+0x2c8/0x2f0
Nov 23 14:16:43 node22 kernel: [<ffffffff80242ab3>]
autoremove_wake_function+0x0/0x2e
Nov 23 14:16:43 node22 kernel: [<ffffffff80242ab3>]
autoremove_wake_function+0x0/0x2e
Nov 23 14:16:43 node22 kernel:
[<ffffffff882ae037>] :fuse:fuse_request_init+0x2f/0x38
Nov 23 14:16:43 node22 kernel:
[<ffffffff882b1761>] :fuse:fuse_open_common+0xef/0x15e
Nov 23 14:16:43 node22 kernel: [<ffffffff882b188e>] :fuse:fuse_open
+0x0/0x7
Nov 23 14:16:43 node22 kernel: [<ffffffff80286e30>] __dentry_open
+0xe6/0x1ba
Nov 23 14:16:43 node22 kernel: [<ffffffff80286f2a>] nameidata_to_filp
+0x26/0x35
Nov 23 14:16:43 node22 kernel: [<ffffffff80286f66>] do_filp_open+0x2d/
0x3d
Nov 23 14:16:43 node22 kernel: [<ffffffff80287180>]
get_unused_fd_flags+0x104/0x113
Nov 23 14:16:43 node22 kernel: [<ffffffff802872a3>] do_sys_open
+0x46/0xc3
Nov 23 14:16:43 node22 kernel: [<ffffffff8020b08b>]
system_call_after_swapgs+0x7b/0x80
Nov 23 14:16:43 node22 kernel:
That's all in /var/log/messages. Remember that the program "pw.x" runs
without problems via NFS as that this is the only program used for
testing presently.
Fred
On 25.11.2008, at 13:42, Joe Landman wrote:
Fred Hucht wrote:
Hi!
The glusterfsd.log on all nodes are virtually empty, the only entry
on 2008-11-25 reads
2008-11-25 03:13:48 E [io-threads.c:273:iot_flush] sc1-ioth: fd
context is NULL, returning EBADFD
on all nodes. I don't think that this is related to our problems.
Regards,
Fred
Hi Fred
Could you post complete /var/log/messages file on pastebin? I have
seen something like this before when fuse crashes. Fuse crashing
could be due to a bug in fuse, the kernel, etc. Also could be
hardware that is failing.
Does an unmount/remount fix the problem?
Joe
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: landman@xxxxxxxxxxxxxxxxxxxxxxx
web : http://www.scalableinformatics.com
http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423 x121
fax : +1 866 888 3112
cell : +1 734 612 4615
Dr. Fred Hucht <fred@xxxxxxxxxxxxxx>
Institute for Theoretical Physics
University of Duisburg-Essen, 47048 Duisburg, Germany