Re: Desperate for Help - Cluster Node randomly reboots

Tan Ban Wee <tan.ban.wee@xxxxxxxxxxxx> · Thu, 27 Nov 2014 09:50:45 +0000

Thanks.

How can I related the below label of the qdisk to the actual disk that the OS is seeing?

<quorumd label="qdisk0">
                <heuristic interval="3" program="ping -c3 -w5 10.101.210.250" tko="5"/>
</quorumd>

-----Original Message-----
From: discuss-bounces@xxxxxxxxxxxx [mailto:discuss-bounces@xxxxxxxxxxxx] On Behalf Of Christine Caulfield
Sent: Thursday, November 27, 2014 5:14 PM
To: discuss@xxxxxxxxxxxx
Subject: Re:  Desperate for Help - Cluster Node randomly reboots

The first thing is to check your qdisk, as that's the daemon that's
causing the reboots. it's complaining about the CRC while reading the
header which leads me to think that the drive/partition is either
offline or corrupted. As the contents are consistent and "0x190a55ad"
then it seems most likely to me that the partition is corrupt - perhaps
formatted as a filesystem by mistake or something else has gone badly wrong.

So. check the partition is OK and not mounted. 'file -c' will tell you
if it's been formatted as something else, for a qdisk it'll just come
back with 'data' as it's mostly zeroes - which is also something to check.

Recreating the qdisk with mkqdisk -l <label> -f <device> might resurrect
it in this case, and is well worth trying. but make sure it's not a
mounted filesystem first as it'll just get corrupted again!

Chrissie

On 27/11/14 08:52, Tan Ban Wee wrote:
> Hi,
>
> Thanks for replying. Is there any logs that I can provide so that we can have more leads to the cause?
>
> Below are from messages. Node 2 got evicted and reboots at 22:37.
>
> These random reboots happens almost everday.
>
> Nov 26 22:37:01 HITGSMQ01 rgmanager[31311]: [script] Executing /etc/init.d/SGQM04 status
> Nov 26 22:37:11 HITGSMQ01 rgmanager[31713]: [script] Executing /etc/init.d/SGQM03 status
> Nov 26 22:37:31 HITGSMQ01 rgmanager[32270]: [script] Executing /etc/init.d/SGQM04 status
> Nov 26 22:37:47 HITGSMQ01 qdiskd[2357]: Writing eviction notice for node 2
> Nov 26 22:37:48 HITGSMQ01 qdiskd[2357]: Node 2 evicted
> Nov 26 22:37:51 HITGSMQ01 rgmanager[433]: [script] Executing /etc/init.d/SGQM03 status
> Nov 26 22:37:52 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor failed, forming new configuration.
> Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [QUORUM] Members[1]: 1
> Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor joined or left the membership and a new membership was forme
> d.
> Nov 26 22:37:54 HITGSMQ01 kernel: dlm: closing connection to node 2
> Nov 26 22:37:54 HITGSMQ01 rgmanager[3146]: State change: HITGSMQ02-hb DOWN
> Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [CPG   ] chosen downlist: sender r(0) ip(10.1.3.3) ; members(old:2 left:1)
> Nov 26 22:37:54 HITGSMQ01 corosync[2303]:   [MAIN  ] Completed service synchronization, ready to provide service.
> Nov 26 22:39:57 HITGSMQ01 kernel: INFO: task rgmanager:469 blocked for more than 120 seconds.
> Nov 26 22:39:57 HITGSMQ01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Nov 26 22:39:57 HITGSMQ01 kernel: rgmanager     D 0000000000000000     0   469   3144 0x00000080
> Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c24111c60 0000000000000086 ffff880c24111c28 ffff880c24111c24
> Nov 26 22:39:57 HITGSMQ01 kernel: ffffffff81055f76 ffff880c7fc23080 ffff880028316700 000000000000047e
> Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c692ad058 ffff880c24111fd8 000000000000fb88 ffff880c692ad058
> Nov 26 22:39:57 HITGSMQ01 kernel: Call Trace:
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81055f76>] ? enqueue_task+0x66/0x80
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fc45>] rwsem_down_failed_common+0x95/0x1d0
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fdd6>] rwsem_down_read_failed+0x26/0x30
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff812833b4>] call_rwsem_down_read_failed+0x14/0x30
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150f2d4>] ? down_read+0x24/0x30
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa038e257>] dlm_user_request+0x47/0x1b0 [dlm]
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8106659b>] ? dequeue_task_fair+0x12b/0x130
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81167c53>] ? kmem_cache_alloc_trace+0x1a3/0x1b0
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa039b2a7>] device_write+0x5c7/0x720 [dlm]
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8109c424>] ? switch_task_namespaces+0x24/0x60
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81181891>] sys_write+0x51/0x90
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa03b09df>] twnotify_sys_write+0x1f/0x80 [twnotify]
> Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
> Nov 26 22:40:12 HITGSMQ01 corosync[2303]:   [TOTEM ] A processor joined or left the membership and a new membership was for
>
> -----Original Message-----
> From: discuss-bounces@xxxxxxxxxxxx [mailto:discuss-bounces@xxxxxxxxxxxx] On Behalf Of Christine Caulfield
> Sent: Thursday, November 27, 2014 4:40 PM
> To: discuss@xxxxxxxxxxxx
> Subject: Re:  Desperate for Help - Cluster Node randomly reboots
>
> On 26/11/14 09:24, Tan Ban Wee wrote:
>> Hi,
>>
>> This is a 2 nodes cluster and they are randomly rebooting itself. I hope
>> someone can help me to narrow down to the cause.
>>
>> Nov 25 23:40:17 qdiskd Node 1 missed an update (3/4)
>>
>> Nov 25 23:40:18 qdiskd Node 1 missed an update (4/4)
>>
>> Nov 25 23:40:19 qdiskd Node 1 missed an update (5/4)
>>
>> Nov 25 23:40:19 qdiskd Node 1 DOWN
>>
>> Nov 25 23:40:19 qdiskd Writing eviction notice for node 1
>>
>> Nov 25 23:40:19 qdiskd Telling CMAN to kill the node
>>
>> Nov 25 23:40:20 qdiskd Node 1 evicted
>>
>> Nov 25 23:44:19 qdiskd Node 1 is UP
>>
>> Nov 25 23:44:20 qdiskd Node 1 shutdown
>>
>> Nov 25 23:44:26 qdiskd Node 1 is UP
>>
>> Nov 25 23:44:37 qdiskd Node 1 missed an update (2/4)
>>
>> Nov 25 23:44:38 qdiskd Node 1 missed an update (3/4)
>>
>> Nov 25 23:44:39 qdiskd Node 1 missed an update (4/4)
>>
>> Nov 25 23:44:40 qdiskd Node 1 missed an update (5/4)
>>
>> Nov 25 23:44:40 qdiskd Node 1 DOWN
>>
>> Nov 25 23:44:40 qdiskd Writing eviction notice for node 1
>>
>> Nov 25 23:44:40 qdiskd Telling CMAN to kill the node
>>
>> Nov 25 23:44:41 qdiskd Node 1 evicted
>>
>> Nov 25 23:50:48 qdiskd Loading dynamic configuration
>>
>> Nov 25 23:50:49 qdiskd Setting autocalculated votes to 1
>>
>> Nov 25 23:50:49 qdiskd Loading static configuration
>>
>> Nov 25 23:50:49 qdiskd Auto-configured TKO as 4 based on token=10000
>> interval=1
>>
>> Nov 25 23:50:49 qdiskd Timings: 4 tko, 1 interval
>>
>> Nov 25 23:50:49 qdiskd Timings: 2 tko_up, 3 master_wait, 2 upgrade_wait
>>
>> Nov 25 23:50:49 qdiskd Heuristic: 'ping -c3 -w5 10.101.210.250' score=1
>> interval=3 tko=5
>>
>> Nov 25 23:50:49 qdiskd 1 heuristics loaded
>>
>> Nov 25 23:50:49 qdiskd Quorum Daemon: 1 heuristics, 1 interval, 4 tko, 1
>> votes
>>
>> Nov 25 23:50:49 qdiskd Run Flags: 00000231
>>
>> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
>> 0x190a55ad
>>
>> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>>
>> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
>> 0x190a55ad
>>
>> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>>
>> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
>> 0x190a55ad
>>
>> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>>
>> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
>> 0x190a55ad
>>
>> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>>
>> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
>> 0x190a55ad
>>
>> Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512
>>
>> Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got:
>> 0x190a55ad
>>
>
>   From that limited information I would guess that your quorum disk
> partition is either offline or corrupted. First check that the drive is
> online and if it seems OK physically then check that it's not been
> formatted as a filesystem or something else by mistake and rebuild the
> header using mkqdisk.
>
> Chrissie
>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> This email message may contain confidential and privileged information for its intended recipient(s) only. If you are not the intended recipient or the addressee, you may have received it by unauthorised means. You are to notify the sender immediately and thereafter delete the email. You are to take note that any disclosure, distribution, use or storage of this communication is strictly prohibited. Any opinions, conclusions and other information in this message that are unrelated to official business of the RHB Banking Group are those of the individual sender and shall be understood as neither explicitly given nor endorsed by the RHB Banking Group. RHB Banking Group shall not be liable for any loss or damage caused by viruses transmitted by this e-mail or its attachments. Further, the RHB Banking Group is also not responsible for any unauthorised changes made to the information or the effect thereto
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss
This email message may contain confidential and privileged information for its intended recipient(s) only. If you are not the intended recipient or the addressee, you may have received it by unauthorised means. You are to notify the sender immediately and thereafter delete the email. You are to take note that any disclosure, distribution, use or storage of this communication is strictly prohibited. Any opinions, conclusions and other information in this message that are unrelated to official business of the RHB Banking Group are those of the individual sender and shall be understood as neither explicitly given nor endorsed by the RHB Banking Group. RHB Banking Group shall not be liable for any loss or damage caused by viruses transmitted by this e-mail or its attachments. Further, the RHB Banking Group is also not responsible for any unauthorised changes made to the information or the effect thereto

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss