Re: Corosync consume 100% cpu with high Recv-Q and hung

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/02/15 01:59, Hui Xiang wrote:
> Hi guys,
> 
>   I am having an issue with corosync where it consumes 100% cpu and hung on
> the command corosync-quorumtool -l, Recv-Q is very high in the meantime
> inside lxc container.
>  corosync version : 2.3.3
> 
>  transport : unicast
> 
>  After setting up 3 keystone nodes with corosync/pacemaker, split brain
> happened, on one of the keystone nodes we found the cpu is 100% used by
> corosync.
> 


It looks like it might be a problem I saw while doing some development
on corosync, if it gets a SEGV, there's a signal handler that catches it
and relays it back to libqb via a pipe, causing another SEGV and
corosync is then just spinning on the pipe for ever. The cause I saw is
not likely yo be the same as yours (it was my coding at the time ;-) but
it does sound like a similar effect. The only way round it is to kill
corosync and restart it. There might be something in the
corosync-blackbox to indicate what went wrong if that has been saved. If
you have that then please post it here so we can have a look.

man corosync-blackbox

Chrissie

> **
> 
> asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie
> %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers
> KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312 cached Mem
> 
>   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43 corosync
> 
> From netstat output, one interesting finding is the Recv-Q size has a value
> 320256, which is higher than normal.
> And after simply doing pkill -9 corosync and restart corosync/pacemaker,
> the whole cluster are back normal.
> 
> Active Internet connections (only servers)
> Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
> udp 320256 0 192.168.100.67:5434 0.0.0.0:* 18637/corosync
> 
> Udp:
>     539832 packets received
>     619 packets to unknown port received.
>     407249 packet receive errors
>     1007262 packets sent
>     RcvbufErrors: 69940
> 
> **
> 
>   So I am asking if there is any bug/issue related with corosync may cause
> it slowly receive packets from socket and hung up due to some reason?
> 
>   Thanks a lot, looking forward for your response.
> 
> 
> Best Regards.
> 
> Hui.
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux