Corosync consume 100% cpu with high Recv-Q and hung

Hui Xiang <xianghuir@xxxxxxxxx> · Mon, 9 Feb 2015 09:59:36 +0800

Hi guys,
  I am having an issue with corosync where it consumes 100% cpu and hung on the command corosync-quorumtool -l, Recv-Q is very high in the meantime inside lxc container.
 corosync version : 2.3.3
 transport : unicast 
 After setting up 3 keystone nodes with corosync/pacemaker, split brain happened, on one of the keystone nodes we found the cpu is 100% used by corosync.
**
asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie
%Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers
KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312 cached Mem
  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43 corosync
From netstat output, one interesting finding is the Recv-Q size has a value 320256, which is higher than normal.
And after simply doing pkill -9 corosync and restart corosync/pacemaker, the whole cluster are back normal.
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 320256 0 192.168.100.67:5434 0.0.0.0:* 18637/corosync
Udp:
    539832 packets received
    619 packets to unknown port received.
    407249 packet receive errors
    1007262 packets sent
    RcvbufErrors: 69940
**
  So I am asking if there is any bug/issue related with corosync may cause it slowly receive packets from socket and hung up due to some reason?
  Thanks a lot, looking forward for your response.

Best Regards.
Hui.
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss