Hi, Thanks for replying. Is there any logs that I can provide so that we can have more leads to the cause? Below are from messages. Node 2 got evicted and reboots at 22:37. These random reboots happens almost everday. Nov 26 22:37:01 HITGSMQ01 rgmanager[31311]: [script] Executing /etc/init.d/SGQM04 status Nov 26 22:37:11 HITGSMQ01 rgmanager[31713]: [script] Executing /etc/init.d/SGQM03 status Nov 26 22:37:31 HITGSMQ01 rgmanager[32270]: [script] Executing /etc/init.d/SGQM04 status Nov 26 22:37:47 HITGSMQ01 qdiskd[2357]: Writing eviction notice for node 2 Nov 26 22:37:48 HITGSMQ01 qdiskd[2357]: Node 2 evicted Nov 26 22:37:51 HITGSMQ01 rgmanager[433]: [script] Executing /etc/init.d/SGQM03 status Nov 26 22:37:52 HITGSMQ01 corosync[2303]: [TOTEM ] A processor failed, forming new configuration. Nov 26 22:37:54 HITGSMQ01 corosync[2303]: [QUORUM] Members[1]: 1 Nov 26 22:37:54 HITGSMQ01 corosync[2303]: [TOTEM ] A processor joined or left the membership and a new membership was forme d. Nov 26 22:37:54 HITGSMQ01 kernel: dlm: closing connection to node 2 Nov 26 22:37:54 HITGSMQ01 rgmanager[3146]: State change: HITGSMQ02-hb DOWN Nov 26 22:37:54 HITGSMQ01 corosync[2303]: [CPG ] chosen downlist: sender r(0) ip(10.1.3.3) ; members(old:2 left:1) Nov 26 22:37:54 HITGSMQ01 corosync[2303]: [MAIN ] Completed service synchronization, ready to provide service. Nov 26 22:39:57 HITGSMQ01 kernel: INFO: task rgmanager:469 blocked for more than 120 seconds. Nov 26 22:39:57 HITGSMQ01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Nov 26 22:39:57 HITGSMQ01 kernel: rgmanager D 0000000000000000 0 469 3144 0x00000080 Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c24111c60 0000000000000086 ffff880c24111c28 ffff880c24111c24 Nov 26 22:39:57 HITGSMQ01 kernel: ffffffff81055f76 ffff880c7fc23080 ffff880028316700 000000000000047e Nov 26 22:39:57 HITGSMQ01 kernel: ffff880c692ad058 ffff880c24111fd8 000000000000fb88 ffff880c692ad058 Nov 26 22:39:57 HITGSMQ01 kernel: Call Trace: Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81055f76>] ? enqueue_task+0x66/0x80 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fc45>] rwsem_down_failed_common+0x95/0x1d0 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150fdd6>] rwsem_down_read_failed+0x26/0x30 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff812833b4>] call_rwsem_down_read_failed+0x14/0x30 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8150f2d4>] ? down_read+0x24/0x30 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa038e257>] dlm_user_request+0x47/0x1b0 [dlm] Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8106659b>] ? dequeue_task_fair+0x12b/0x130 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81167c53>] ? kmem_cache_alloc_trace+0x1a3/0x1b0 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa039b2a7>] device_write+0x5c7/0x720 [dlm] Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8109c424>] ? switch_task_namespaces+0x24/0x60 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8121baf6>] ? security_file_permission+0x16/0x20 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81180f98>] vfs_write+0xb8/0x1a0 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff81181891>] sys_write+0x51/0x90 Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffffa03b09df>] twnotify_sys_write+0x1f/0x80 [twnotify] Nov 26 22:39:57 HITGSMQ01 kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b Nov 26 22:40:12 HITGSMQ01 corosync[2303]: [TOTEM ] A processor joined or left the membership and a new membership was for -----Original Message----- From: discuss-bounces@xxxxxxxxxxxx [mailto:discuss-bounces@xxxxxxxxxxxx] On Behalf Of Christine Caulfield Sent: Thursday, November 27, 2014 4:40 PM To: discuss@xxxxxxxxxxxx Subject: Re: Desperate for Help - Cluster Node randomly reboots On 26/11/14 09:24, Tan Ban Wee wrote: > Hi, > > This is a 2 nodes cluster and they are randomly rebooting itself. I hope > someone can help me to narrow down to the cause. > > Nov 25 23:40:17 qdiskd Node 1 missed an update (3/4) > > Nov 25 23:40:18 qdiskd Node 1 missed an update (4/4) > > Nov 25 23:40:19 qdiskd Node 1 missed an update (5/4) > > Nov 25 23:40:19 qdiskd Node 1 DOWN > > Nov 25 23:40:19 qdiskd Writing eviction notice for node 1 > > Nov 25 23:40:19 qdiskd Telling CMAN to kill the node > > Nov 25 23:40:20 qdiskd Node 1 evicted > > Nov 25 23:44:19 qdiskd Node 1 is UP > > Nov 25 23:44:20 qdiskd Node 1 shutdown > > Nov 25 23:44:26 qdiskd Node 1 is UP > > Nov 25 23:44:37 qdiskd Node 1 missed an update (2/4) > > Nov 25 23:44:38 qdiskd Node 1 missed an update (3/4) > > Nov 25 23:44:39 qdiskd Node 1 missed an update (4/4) > > Nov 25 23:44:40 qdiskd Node 1 missed an update (5/4) > > Nov 25 23:44:40 qdiskd Node 1 DOWN > > Nov 25 23:44:40 qdiskd Writing eviction notice for node 1 > > Nov 25 23:44:40 qdiskd Telling CMAN to kill the node > > Nov 25 23:44:41 qdiskd Node 1 evicted > > Nov 25 23:50:48 qdiskd Loading dynamic configuration > > Nov 25 23:50:49 qdiskd Setting autocalculated votes to 1 > > Nov 25 23:50:49 qdiskd Loading static configuration > > Nov 25 23:50:49 qdiskd Auto-configured TKO as 4 based on token=10000 > interval=1 > > Nov 25 23:50:49 qdiskd Timings: 4 tko, 1 interval > > Nov 25 23:50:49 qdiskd Timings: 2 tko_up, 3 master_wait, 2 upgrade_wait > > Nov 25 23:50:49 qdiskd Heuristic: 'ping -c3 -w5 10.101.210.250' score=1 > interval=3 tko=5 > > Nov 25 23:50:49 qdiskd 1 heuristics loaded > > Nov 25 23:50:49 qdiskd Quorum Daemon: 1 heuristics, 1 interval, 4 tko, 1 > votes > > Nov 25 23:50:49 qdiskd Run Flags: 00000231 > > Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got: > 0x190a55ad > > Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512 > > Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got: > 0x190a55ad > > Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512 > > Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got: > 0x190a55ad > > Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512 > > Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got: > 0x190a55ad > > Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512 > > Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got: > 0x190a55ad > > Nov 25 23:50:49 qdiskd diskRawReadShadow: bad CRC32, offset = 0 len = 512 > > Nov 25 23:50:49 qdiskd Header CRC32 mismatch; Exp: 0x00000000 Got: > 0x190a55ad > From that limited information I would guess that your quorum disk partition is either offline or corrupted. First check that the drive is online and if it seems OK physically then check that it's not been formatted as a filesystem or something else by mistake and rebuild the header using mkqdisk. Chrissie _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss This email message may contain confidential and privileged information for its intended recipient(s) only. If you are not the intended recipient or the addressee, you may have received it by unauthorised means. You are to notify the sender immediately and thereafter delete the email. You are to take note that any disclosure, distribution, use or storage of this communication is strictly prohibited. Any opinions, conclusions and other information in this message that are unrelated to official business of the RHB Banking Group are those of the individual sender and shall be understood as neither explicitly given nor endorsed by the RHB Banking Group. RHB Banking Group shall not be liable for any loss or damage caused by viruses transmitted by this e-mail or its attachments. Further, the RHB Banking Group is also not responsible for any unauthorised changes made to the information or the effect thereto _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss