Re: Desperate for Help - Cluster Node randomly reboots

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Thu, 27 Nov 2014 09:14:17 +0000

The first thing is to check your qdisk, as that's the daemon that's 
causing the reboots. it's complaining about the CRC while reading the 
header which leads me to think that the drive/partition is either 
offline or corrupted. As the contents are consistent and "0x190a55ad" 
then it seems most likely to me that the partition is corrupt - perhaps 
formatted as a filesystem by mistake or something else has gone badly wrong.

So. check the partition is OK and not mounted. 'file -c' will tell you 
if it's been formatted as something else, for a qdisk it'll just come 
back with 'data' as it's mostly zeroes - which is also something to check.

Recreating the qdisk with mkqdisk -l <label> -f <device> might resurrect 
it in this case, and is well worth trying. but make sure it's not a 
mounted filesystem first as it'll just get corrupted again!

Chrissie

On 27/11/14 08:52, Tan Ban Wee wrote:
Hi,

Thanks for replying. Is there any logs that I can provide so that we can have more leads to the cause?

Below are from messages. Node 2 got evicted and reboots at 22:37.

These random reboots happens almost everday.

Nov 26 22:37:01 HITGSMQ01 rgmanager[31311]: [script] Executing /etc/init.d/SGQM04 status
Nov 26 22:37:11 HITGSMQ01 rgmanager[31713]: [script] Executing /etc/init.d/SGQM03 status
Nov 26 22:37:31 HITGSMQ01 rgmanager[32270]: [script] Executing /etc/init.d/SGQM04 status
Nov 26 22:37:47 HITGSMQ01 qdiskd[2357]: Writing eviction notice for node 2
Nov 26 22:37:48 HITGSMQ01 qdiskd[2357]: Node 2 evicted
Nov 26 22:37:51 HITGSMQ01 rgmanager[433]: [script] Executing /etc/init.d/SGQM03 status
Nov 26 22:37:52 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor failed, forming new configuration.
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [QUORUM] Members[1]: 1
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor joined or left the membership and a new membership was forme
d.
Nov 26 22:37:54 HITGSMQ01 kernel: dlm: closing connection to node 2
Nov 26 22:37:54 HITGSMQ01 rgmanager[3146]: State change: HITGSMQ02-hb DOWN
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [CPG   ] chosen downlist: sender r(0) ip(10.1.3.3) ; members(old:2 left:1)
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 22:39:57 HITGSMQ01 kernel: INFO: task rgmanager:469 blocked for more than 120 seconds.
Nov 26 22:39:57 HITGSMQ01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 22:39:57 HITGSMQ01 kernel: rgmanager     D 0000000000000000     0   469   3144 0x00000080
Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c24111c60 0000000000000086 ffff880c24111c28 ffff880c24111c24
Nov 26 22:39:57 HITGSMQ01 kernel: ffffffff81055f76 ffff880c7fc23080 ffff880028316700 000000000000047e
Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c692ad058 ffff880c24111fd8 000000000000fb88 ffff880c692ad058
Nov 26 22:39:57 HITGSMQ01 kernel: Call Trace:
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81055f76>] ? enqueue_task+0x66/0x80
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fc45>] rwsem_down_failed_common+0x95/0x1d0
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fdd6>] rwsem_down_read_failed+0x26/0x30
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff812833b4>] call_rwsem_down_read_failed+0x14/0x30
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150f2d4>] ? down_read+0x24/0x30
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa038e257>] dlm_user_request+0x47/0x1b0 [dlm]
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8106659b>] ? dequeue_task_fair+0x12b/0x130
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81167c53>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa039b2a7>] device_write+0x5c7/0x720 [dlm]
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8109c424>] ? switch_task_namespaces+0x24/0x60
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81181891>] sys_write+0x51/0x90
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa03b09df>] twnotify_sys_write+0x1f/0x80 [twnotify]
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov 26 22:40:12 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor joined or left the membership and a new membership was for

-----Original Message-----
From: discuss-bounces@xxxxxxxxxxxx [mailto:discuss-bounces@xxxxxxxxxxxx] On Behalf Of Christine Caulfield
Sent: Thursday, November 27, 2014 4:40 PM
To: discuss@xxxxxxxxxxxx
Subject: Re:  Desperate for Help - Cluster Node randomly reboots

On 26/11/14 09:24, Tan Ban Wee wrote:
Hi,

This is a 2 nodes cluster and they are randomly rebooting itself. I hope
someone can help me to narrow down to the cause.

Nov 25 23:40:17 qdiskd Node 1 missed an update (3/4)

Nov 25 23:40:18 qdiskd Node 1 missed an update (4/4)

Nov 25 23:40:19 qdiskd Node 1 missed an update (5/4)

Nov 25 23:40:19 qdiskd Node 1 DOWN

Nov 25 23:40:19 qdiskd Writing eviction notice for node 1

Nov 25 23:40:19 qdiskd Telling CMAN to kill the node

Nov 25 23:40:20 qdiskd Node 1 evicted

Nov 25 23:44:19 qdiskd Node 1 is UP

Nov 25 23:44:20 qdiskd Node 1 shutdown

Nov 25 23:44:26 qdiskd Node 1 is UP

Nov 25 23:44:37 qdiskd Node 1 missed an update (2/4)

Nov 25 23:44:38 qdiskd Node 1 missed an update (3/4)

Nov 25 23:44:39 qdiskd Node 1 missed an update (4/4)

Nov 25 23:44:40 qdiskd Node 1 missed an update (5/4)

Nov 25 23:44:40 qdiskd Node 1 DOWN

Nov 25 23:44:40 qdiskd Writing eviction notice for node 1

Nov 25 23:44:40 qdiskd Telling CMAN to kill the node

Nov 25 23:44:41 qdiskd Node 1 evicted

Nov 25 23:50:48 qdiskd Loading dynamic configuration

Nov 25 23:50:49 qdiskd Setting autocalculated votes to 1

Nov 25 23:50:49 qdiskd Loading static configuration

Nov 25 23:50:49 qdiskd Auto-configured TKO as 4 based on token=10000
interval=1

Nov 25 23:50:49 qdiskd Timings: 4 tko, 1 interval

Nov 25 23:50:49 qdiskd Timings: 2 tko_up, 3 master_wait, 2 upgrade_wait

Nov 25 23:50:49 qdiskd Heuristic: 'ping -c3 -w5 10.101.210.250' score=1
interval=3 tko=5

Nov 25 23:50:49 qdiskd 1 heuristics loaded

Nov 25 23:50:49 qdiskd Quorum Daemon: 1 heuristics, 1 interval, 4 tko, 1
votes

Nov 25 23:50:49 qdiskd Run Flags: 00000231

Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
0x190a55ad

Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512

Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
0x190a55ad

Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512

Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
0x190a55ad

Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512

Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
0x190a55ad

Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512

Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
0x190a55ad

Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512

Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
0x190a55ad

  From that limited information I would guess that your quorum disk
partition is either offline or corrupted. First check that the drive is
online and if it seems OK physically then check that it's not been
formatted as a filesystem or something else by mistake and rebuild the
header using mkqdisk.

Chrissie

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss
This email message may contain confidential and privileged information for its intended recipient(s) only. If you are not the intended recipient or the addressee, you may have received it by unauthorised means. You are to notify the sender immediately and thereafter delete the email. You are to take note that any disclosure, distribution, use or storage of this communication is strictly prohibited. Any opinions, conclusions and other information in this message that are unrelated to official business of the RHB Banking Group are those of the individual sender and shall be understood as neither explicitly given nor endorsed by the RHB Banking Group. RHB Banking Group shall not be liable for any loss or damage caused by viruses transmitted by this e-mail or its attachments. Further, the RHB Banking Group is also not responsible for any unauthorised changes made to the information or the effect thereto

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss