Re: Corosync consume 100% cpu with high Recv-Q and hung

Hui Xiang <xianghuir@xxxxxxxxx> · Tue, 21 Apr 2015 19:37:19 +0800

Thanks Christine.
One more question, in the broken environment, we found part of the source code in libqb as below:
1)
void *
qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
{
        uint32_t write_pt;

        if (rb == NULL) {
                errno = EINVAL;
                return NULL;
        }
        /*
         * Reclaim data if we are over writing and we need space
         */
        if (rb->flags & QB_RB_FLAG_OVERWRITE) {
                while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
                        _rb_chunk_reclaim(rb);
                }
        } else {
                if (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
                        errno = EAGAIN;
                        return NULL;
                }
        }

but in the master branch:
2)
                while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
                        int rc = _rb_chunk_reclaim(rb);
                        if (rc != 0) {
                                errno = rc;
                                return NULL;
                        }
                }

is it possible that the code 1) we have been stucked in the infinite loop of 
while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {...} on the condition that 'chunk_magic != QB_RB_CHUNK_MAGIC', function _rb_chunk_reclaim() just return:
static void
_rb_chunk_reclaim(struct qb_ringbuffer_s * rb)
{
        uint32_t old_read_pt;
        uint32_t new_read_pt;
        uint32_t old_chunk_size;
        uint32_t chunk_magic;

        old_read_pt = rb->shared_hdr->read_pt;
        chunk_magic = QB_RB_CHUNK_MAGIC_GET(rb, old_read_pt);
        if (chunk_magic != QB_RB_CHUNK_MAGIC) {
                return;
        }

and there is a commit seems fix it [1], do you know what's the background of this commit? does it look to fix it?

Thanks again :)

[1] https://github.com/ClusterLabs/libqb/commit/a8852fc481e3aa3fce53bb9e3db79d3e7cbed0c1

On Tue, Apr 21, 2015 at 5:55 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote:
Hiya,

It's hard to be sure without more information, sadly - if the backtrace

looks similar to the one you mention then upgrading libqb to 0.17 should

help.

Chrissie

On 21/04/15 07:12, Hui Xiang wrote:

> Thanks Christine, sorry for responding late.

>

> I got this problem again,  and corosync-blackbox just hang there, no

> output. there are some other debug information for you guys.

>

> The backtrace and perf.data are very similar as link [1], but we don't

> know what's the root cause, sure restart corosync is one of the

> solution, but after a while it breaks again, so we'd like to find out

> what's really going on there.

>

> Thanks for your efforts, very appreciated : )

>

> [1] http://www.spinics.net/lists/corosync/msg03445.html

>

>

> On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx

> <mailto:ccaulfie@xxxxxxxxxx>> wrote:

>

>     On 09/02/15 01:59, Hui Xiang wrote:

>     > Hi guys,

>     >

>     >   I am having an issue with corosync where it consumes 100% cpu and hung on

>     > the command corosync-quorumtool -l, Recv-Q is very high in the meantime

>     > inside lxc container.

>     >  corosync version : 2.3.3

>     >

>     >  transport : unicast

>     >

>     >  After setting up 3 keystone nodes with corosync/pacemaker, split brain

>     > happened, on one of the keystone nodes we found the cpu is 100% used by

>     > corosync.

>     >

>

>

>     It looks like it might be a problem I saw while doing some development

>     on corosync, if it gets a SEGV, there's a signal handler that catches it

>     and relays it back to libqb via a pipe, causing another SEGV and

>     corosync is then just spinning on the pipe for ever. The cause I saw is

>     not likely yo be the same as yours (it was my coding at the time ;-) but

>     it does sound like a similar effect. The only way round it is to kill

>     corosync and restart it. There might be something in the

>     corosync-blackbox to indicate what went wrong if that has been saved. If

>     you have that then please post it here so we can have a look.

>

>     man corosync-blackbox

>

>     Chrissie

>

>     > **

>     >

>     > asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie

>     > %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si,

>     0.0 st

>     > KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers

>     > KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312 cached Mem

>     >

>     >   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

>     > 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43 corosync

>     >

>     > From netstat output, one interesting finding is the Recv-Q size

>     has a value

>     > 320256, which is higher than normal.

>     > And after simply doing pkill -9 corosync and restart

>     corosync/pacemaker,

>     > the whole cluster are back normal.

>     >

>     > Active Internet connections (only servers)

>     > Proto Recv-Q Send-Q Local Address Foreign Address State

>     PID/Program name

>     > udp 320256 0 192.168.100.67:5434 <http://192.168.100.67:5434>

>     0.0.0.0:* 18637/corosync

>     >

>     > Udp:

>     >     539832 packets received

>     >     619 packets to unknown port received.

>     >     407249 packet receive errors

>     >     1007262 packets sent

>     >     RcvbufErrors: 69940

>     >

>     > **

>     >

>     >   So I am asking if there is any bug/issue related with corosync

>     may cause

>     > it slowly receive packets from socket and hung up due to some reason?

>     >

>     >   Thanks a lot, looking forward for your response.

>     >

>     >

>     > Best Regards.

>     >

>     > Hui.

>     >

>     >

>     >

>     > _______________________________________________

>     > discuss mailing list

>     > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>

>     > http://lists.corosync.org/mailman/listinfo/discuss

>     >

>

>     _______________________________________________

>     discuss mailing list

>     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>

>     http://lists.corosync.org/mailman/listinfo/discuss

>

>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss