Re: Desperate for Help - Cluster Node randomly reboots

Tan Ban Wee <tan.ban.wee@xxxxxxxxxxxx> · Thu, 27 Nov 2014 08:52:40 +0000

Hi,

Thanks for replying. Is there any logs that I can provide so that we can have more leads to the cause?

Below are from messages. Node 2 got evicted and reboots at 22:37.

These random reboots happens almost everday.

Nov 26 22:37:01 HITGSMQ01 rgmanager[31311]: [script] Executing /etc/init.d/SGQM04 status
Nov 26 22:37:11 HITGSMQ01 rgmanager[31713]: [script] Executing /etc/init.d/SGQM03 status
Nov 26 22:37:31 HITGSMQ01 rgmanager[32270]: [script] Executing /etc/init.d/SGQM04 status
Nov 26 22:37:47 HITGSMQ01 qdiskd[2357]: Writing eviction notice for node 2
Nov 26 22:37:48 HITGSMQ01 qdiskd[2357]: Node 2 evicted
Nov 26 22:37:51 HITGSMQ01 rgmanager[433]: [script] Executing /etc/init.d/SGQM03 status
Nov 26 22:37:52 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor failed, forming new configuration.
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [QUORUM] Members[1]: 1
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor joined or left the membership and a new membership was forme
d.
Nov 26 22:37:54 HITGSMQ01 kernel: dlm: closing connection to node 2
Nov 26 22:37:54 HITGSMQ01 rgmanager[3146]: State change: HITGSMQ02-hb DOWN
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [CPG   ] chosen downlist: sender r(0) ip(10.1.3.3) ; members(old:2 left:1)
Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [MAIN  ] Completed service synchronization, ready to provide service.
Nov 26 22:39:57 HITGSMQ01 kernel: INFO: task rgmanager:469 blocked for more than 120 seconds.
Nov 26 22:39:57 HITGSMQ01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 26 22:39:57 HITGSMQ01 kernel: rgmanager     D 0000000000000000     0   469   3144 0x00000080
Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c24111c60 0000000000000086 ffff880c24111c28 ffff880c24111c24
Nov 26 22:39:57 HITGSMQ01 kernel: ffffffff81055f76 ffff880c7fc23080 ffff880028316700 000000000000047e
Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c692ad058 ffff880c24111fd8 000000000000fb88 ffff880c692ad058
Nov 26 22:39:57 HITGSMQ01 kernel: Call Trace:
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81055f76>] ? enqueue_task+0x66/0x80
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fc45>] rwsem_down_failed_common+0x95/0x1d0
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fdd6>] rwsem_down_read_failed+0x26/0x30
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff812833b4>] call_rwsem_down_read_failed+0x14/0x30
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150f2d4>] ? down_read+0x24/0x30
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa038e257>] dlm_user_request+0x47/0x1b0 [dlm]
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8106659b>] ? dequeue_task_fair+0x12b/0x130
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81167c53>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa039b2a7>] device_write+0x5c7/0x720 [dlm]
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8109c424>] ? switch_task_namespaces+0x24/0x60
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81181891>] sys_write+0x51/0x90
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa03b09df>] twnotify_sys_write+0x1f/0x80 [twnotify]
Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Nov 26 22:40:12 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor joined or left the membership and a new membership was for

-----Original Message-----
From: discuss-bounces@xxxxxxxxxxxx [mailto:discuss-bounces@xxxxxxxxxxxx] On Behalf Of Christine Caulfield
Sent: Thursday, November 27, 2014 4:40 PM
To: discuss@xxxxxxxxxxxx
Subject: Re:  Desperate for Help - Cluster Node randomly reboots

On 26/11/14 09:24, Tan Ban Wee wrote:
> Hi,
>
> This is a 2 nodes cluster and they are randomly rebooting itself. I hope
> someone can help me to narrow down to the cause.
>
> Nov 25 23:40:17 qdiskd Node 1 missed an update (3/4)
>
> Nov 25 23:40:18 qdiskd Node 1 missed an update (4/4)
>
> Nov 25 23:40:19 qdiskd Node 1 missed an update (5/4)
>
> Nov 25 23:40:19 qdiskd Node 1 DOWN
>
> Nov 25 23:40:19 qdiskd Writing eviction notice for node 1
>
> Nov 25 23:40:19 qdiskd Telling CMAN to kill the node
>
> Nov 25 23:40:20 qdiskd Node 1 evicted
>
> Nov 25 23:44:19 qdiskd Node 1 is UP
>
> Nov 25 23:44:20 qdiskd Node 1 shutdown
>
> Nov 25 23:44:26 qdiskd Node 1 is UP
>
> Nov 25 23:44:37 qdiskd Node 1 missed an update (2/4)
>
> Nov 25 23:44:38 qdiskd Node 1 missed an update (3/4)
>
> Nov 25 23:44:39 qdiskd Node 1 missed an update (4/4)
>
> Nov 25 23:44:40 qdiskd Node 1 missed an update (5/4)
>
> Nov 25 23:44:40 qdiskd Node 1 DOWN
>
> Nov 25 23:44:40 qdiskd Writing eviction notice for node 1
>
> Nov 25 23:44:40 qdiskd Telling CMAN to kill the node
>
> Nov 25 23:44:41 qdiskd Node 1 evicted
>
> Nov 25 23:50:48 qdiskd Loading dynamic configuration
>
> Nov 25 23:50:49 qdiskd Setting autocalculated votes to 1
>
> Nov 25 23:50:49 qdiskd Loading static configuration
>
> Nov 25 23:50:49 qdiskd Auto-configured TKO as 4 based on token=10000
> interval=1
>
> Nov 25 23:50:49 qdiskd Timings: 4 tko, 1 interval
>
> Nov 25 23:50:49 qdiskd Timings: 2 tko_up, 3 master_wait, 2 upgrade_wait
>
> Nov 25 23:50:49 qdiskd Heuristic: 'ping -c3 -w5 10.101.210.250' score=1
> interval=3 tko=5
>
> Nov 25 23:50:49 qdiskd 1 heuristics loaded
>
> Nov 25 23:50:49 qdiskd Quorum Daemon: 1 heuristics, 1 interval, 4 tko, 1
> votes
>
> Nov 25 23:50:49 qdiskd Run Flags: 00000231
>
> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
> 0x190a55ad
>
> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>
> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
> 0x190a55ad
>
> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>
> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
> 0x190a55ad
>
> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>
> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
> 0x190a55ad
>
> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>
> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
> 0x190a55ad
>
> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>
> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
> 0x190a55ad
>

 From that limited information I would guess that your quorum disk
partition is either offline or corrupted. First check that the drive is
online and if it seems OK physically then check that it's not been
formatted as a filesystem or something else by mistake and rebuild the
header using mkqdisk.

Chrissie

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss
This email message may contain confidential and privileged information for its intended recipient(s) only. If you are not the intended recipient or the addressee, you may have received it by unauthorised means. You are to notify the sender immediately and thereafter delete the email. You are to take note that any disclosure, distribution, use or storage of this communication is strictly prohibited. Any opinions, conclusions and other information in this message that are unrelated to official business of the RHB Banking Group are those of the individual sender and shall be understood as neither explicitly given nor endorsed by the RHB Banking Group. RHB Banking Group shall not be liable for any loss or damage caused by viruses transmitted by this e-mail or its attachments. Further, the RHB Banking Group is also not responsible for any unauthorised changes made to the information or the effect thereto

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss