Re: Corosync consume 100% cpu with high Recv-Q and hung

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Tue, 21 Apr 2015 13:33:47 +0100

On 21/04/15 12:37, Hui Xiang wrote:
> Thanks Christine.
> 
> One more question, in the broken environment, we found part of the
> source code in libqb as below:
> 1)
> void *
> qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
> {
>         uint32_t write_pt;
> 
>         if (rb == NULL) {
>                 errno = EINVAL;
>                 return NULL;
>         }
>         /*
>          * Reclaim data if we are over writing and we need space
>          */
>         if (rb->flags & QB_RB_FLAG_OVERWRITE) {
>                 while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
>                         *_rb_chunk_reclaim(rb);*
>                 }
>         } else {
>                 if (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
>                         errno = EAGAIN;
>                         return NULL;
>                 }
>         }
> 
> but in the master branch:
> 2)
>                 while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
> *                        int rc = _rb_chunk_reclaim(rb);*
> *                        if (rc != 0) {*
> *                                errno = rc;*
> *                                return NULL;*
>                         }
>                 }
> 
> 
> is it possible that the code 1) we have been stucked in the infinite
> loop of 
> while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {...} on the
> condition that 'chunk_magic != QB_RB_CHUNK_MAGIC', function
> _rb_chunk_reclaim() just return:
> static void
> _rb_chunk_reclaim(struct qb_ringbuffer_s * rb)
> {
>         uint32_t old_read_pt;
>         uint32_t new_read_pt;
>         uint32_t old_chunk_size;
>         uint32_t chunk_magic;
> 
>         old_read_pt = rb->shared_hdr->read_pt;
>         chunk_magic = QB_RB_CHUNK_MAGIC_GET(rb, old_read_pt);
>    *    if (chunk_magic != QB_RB_CHUNK_MAGIC) {*
> *                return;*
> *        }*
> *
> *
> *
> *
> and there is a commit seems fix it [1], do you know what's the
> background of this commit? does it look to fix it?
> 
> Thanks again :)

I don't know enough about the background to that fix. What you're saying
sounds plausible but I can't be sure. There are quite a few stability
fixed in libqb 0.17 so it could be that or one of the others!

Chrissie

> [1]
> https://github.com/ClusterLabs/libqb/commit/a8852fc481e3aa3fce53bb9e3db79d3e7cbed0c1
> 
> 
> 
> On Tue, Apr 21, 2015 at 5:55 PM, Christine Caulfield
> <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>> wrote:
> 
>     Hiya,
> 
>     It's hard to be sure without more information, sadly - if the backtrace
>     looks similar to the one you mention then upgrading libqb to 0.17 should
>     help.
> 
>     Chrissie
> 
>     On 21/04/15 07:12, Hui Xiang wrote:
>     > Thanks Christine, sorry for responding late.
>     >
>     > I got this problem again,  and corosync-blackbox just hang there, no
>     > output. there are some other debug information for you guys.
>     >
>     > The backtrace and perf.data are very similar as link [1], but we don't
>     > know what's the root cause, sure restart corosync is one of the
>     > solution, but after a while it breaks again, so we'd like to find out
>     > what's really going on there.
>     >
>     > Thanks for your efforts, very appreciated : )
>     >
>     > [1] http://www.spinics.net/lists/corosync/msg03445.html
>     >
>     >
>     > On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>
>     > <mailto:ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>>> wrote:
>     >
>     >     On 09/02/15 01:59, Hui Xiang wrote:
>     >     > Hi guys,
>     >     >
>     >     >   I am having an issue with corosync where it consumes 100%
>     cpu and hung on
>     >     > the command corosync-quorumtool -l, Recv-Q is very high in
>     the meantime
>     >     > inside lxc container.
>     >     >  corosync version : 2.3.3
>     >     >
>     >     >  transport : unicast
>     >     >
>     >     >  After setting up 3 keystone nodes with corosync/pacemaker,
>     split brain
>     >     > happened, on one of the keystone nodes we found the cpu is
>     100% used by
>     >     > corosync.
>     >     >
>     >
>     >
>     >     It looks like it might be a problem I saw while doing some
>     development
>     >     on corosync, if it gets a SEGV, there's a signal handler that
>     catches it
>     >     and relays it back to libqb via a pipe, causing another SEGV and
>     >     corosync is then just spinning on the pipe for ever. The cause
>     I saw is
>     >     not likely yo be the same as yours (it was my coding at the
>     time ;-) but
>     >     it does sound like a similar effect. The only way round it is
>     to kill
>     >     corosync and restart it. There might be something in the
>     >     corosync-blackbox to indicate what went wrong if that has been
>     saved. If
>     >     you have that then please post it here so we can have a look.
>     >
>     >     man corosync-blackbox
>     >
>     >     Chrissie
>     >
>     >     > **
>     >     >
>     >     > asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie
>     >     > %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi,
>     0.0 si,
>     >     0.0 st
>     >     > KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers
>     >     > KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312
>     cached Mem
>     >     >
>     >     >   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>     >     > 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43
>     corosync
>     >     >
>     >     > From netstat output, one interesting finding is the Recv-Q size
>     >     has a value
>     >     > 320256, which is higher than normal.
>     >     > And after simply doing pkill -9 corosync and restart
>     >     corosync/pacemaker,
>     >     > the whole cluster are back normal.
>     >     >
>     >     > Active Internet connections (only servers)
>     >     > Proto Recv-Q Send-Q Local Address Foreign Address State
>     >     PID/Program name
>     >     > udp 320256 0 192.168.100.67:5434
>     <http://192.168.100.67:5434> <http://192.168.100.67:5434>
>     >     0.0.0.0:* 18637/corosync
>     >     >
>     >     > Udp:
>     >     >     539832 packets received
>     >     >     619 packets to unknown port received.
>     >     >     407249 packet receive errors
>     >     >     1007262 packets sent
>     >     >     RcvbufErrors: 69940
>     >     >
>     >     > **
>     >     >
>     >     >   So I am asking if there is any bug/issue related with corosync
>     >     may cause
>     >     > it slowly receive packets from socket and hung up due to some reason?
>     >     >
>     >     >   Thanks a lot, looking forward for your response.
>     >     >
>     >     >
>     >     > Best Regards.
>     >     >
>     >     > Hui.
>     >     >
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > discuss mailing list
>     >     > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>     <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>>
>     >     > http://lists.corosync.org/mailman/listinfo/discuss
>     >     >
>     >
>     >     _______________________________________________
>     >     discuss mailing list
>     >     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>     <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>>
>     >     http://lists.corosync.org/mailman/listinfo/discuss
>     >
>     >
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss