Re: Corosync consume 100% cpu with high Recv-Q and hung

Hui Xiang <xianghuir@xxxxxxxxx> · Tue, 21 Apr 2015 14:12:04 +0800

Thanks Christine, sorry for responding late.
I got this problem again,  and corosync-blackbox just hang there, no output. there are some other debug information for you guys.

The backtrace and perf.data are very similar as link [1], but we don't know what's the root cause, sure restart corosync is one of the solution, but after a while it breaks again, so we'd like to find out what's really going on there.

Thanks for your efforts, very appreciated : )

[1] http://www.spinics.net/lists/corosync/msg03445.html

On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote:
On 09/02/15 01:59, Hui Xiang wrote:

> Hi guys,

>

>   I am having an issue with corosync where it consumes 100% cpu and hung on

> the command corosync-quorumtool -l, Recv-Q is very high in the meantime

> inside lxc container.

>  corosync version : 2.3.3

>

>  transport : unicast

>

>  After setting up 3 keystone nodes with corosync/pacemaker, split brain

> happened, on one of the keystone nodes we found the cpu is 100% used by

> corosync.

>

It looks like it might be a problem I saw while doing some development

on corosync, if it gets a SEGV, there's a signal handler that catches it

and relays it back to libqb via a pipe, causing another SEGV and

corosync is then just spinning on the pipe for ever. The cause I saw is

not likely yo be the same as yours (it was my coding at the time ;-) but

it does sound like a similar effect. The only way round it is to kill

corosync and restart it. There might be something in the

corosync-blackbox to indicate what went wrong if that has been saved. If

you have that then please post it here so we can have a look.

man corosync-blackbox

Chrissie

> **

>

> asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie

> %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

> KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers

> KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312 cached Mem

>

>   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

> 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43 corosync

>

> From netstat output, one interesting finding is the Recv-Q size has a value

> 320256, which is higher than normal.

> And after simply doing pkill -9 corosync and restart corosync/pacemaker,

> the whole cluster are back normal.

>

> Active Internet connections (only servers)

> Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

> udp 320256 0 192.168.100.67:5434 0.0.0.0:* 18637/corosync

>

> Udp:

>     539832 packets received

>     619 packets to unknown port received.

>     407249 packet receive errors

>     1007262 packets sent

>     RcvbufErrors: 69940

>

> **

>

>   So I am asking if there is any bug/issue related with corosync may cause

> it slowly receive packets from socket and hung up due to some reason?

>

>   Thanks a lot, looking forward for your response.

>

>

> Best Regards.

>

> Hui.

>

>

>

> _______________________________________________

> discuss mailing list

> discuss@xxxxxxxxxxxx

> http://lists.corosync.org/mailman/listinfo/discuss

>

_______________________________________________

discuss mailing list

discuss@xxxxxxxxxxxx

http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss