Re: Corosync consume 100% cpu with high Recv-Q and hung

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



In the master branch function qb_rb_chunk_alloc() maybe failed due to _rb_chunk_reclai() return -EINVAL on the condition that (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)), where I got backtrace:

451                     while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
(gdb) p qb_rb_space_free(rb)
$5 = 408
(gdb) p len
$6 = 561

QB_RB_CHUNK_MARGIN apparently equals to 12
#define QB_RB_CHUNK_MARGIN (sizeof(uint32_t) * (QB_RB_CHUNK_HEADER_WORDS +\
                                                QB_RB_WORD_ALIGN +\
                                                QB_CACHE_LINE_WORDS))



qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
{
       .....
        /*
         * Reclaim data if we are over writing and we need space
         */
        if (rb->flags & QB_RB_FLAG_OVERWRITE) {
                while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
                        int rc = _rb_chunk_reclaim(rb);
                        if (rc != 0) {
                                errno = rc;
                                return NULL;
                        }
                }

So do you know how should we control the value of 'len' and where it comes to avoid failing to call function qb_rb_chunk_alloc(), I can reproduce this problem by setting one of the corosync node nic into 5000 mtu, others are staying at 1500 everywhere, is it related about the big packets throw to the ringbuffer? 


Thanks Christine,  very appreciate your reply : ) 


On Tue, Apr 21, 2015 at 8:33 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote:
On 21/04/15 12:37, Hui Xiang wrote:
> Thanks Christine.
>
> One more question, in the broken environment, we found part of the
> source code in libqb as below:
> 1)
> void *
> qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len)
> {
>         uint32_t write_pt;
>
>         if (rb == NULL) {
>                 errno = EINVAL;
>                 return NULL;
>         }
>         /*
>          * Reclaim data if we are over writing and we need space
>          */
>         if (rb->flags & QB_RB_FLAG_OVERWRITE) {
>                 while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
>                         *_rb_chunk_reclaim(rb);*
>                 }
>         } else {
>                 if (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
>                         errno = EAGAIN;
>                         return NULL;
>                 }
>         }
>
> but in the master branch:
> 2)
>                 while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {
> *                        int rc = _rb_chunk_reclaim(rb);*
> *                        if (rc != 0) {*
> *                                errno = rc;*
> *                                return NULL;*
>                         }
>                 }
>
>
> is it possible that the code 1) we have been stucked in the infinite
> loop of
> while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {...} on the
> condition that 'chunk_magic != QB_RB_CHUNK_MAGIC', function
> _rb_chunk_reclaim() just return:
> static void
> _rb_chunk_reclaim(struct qb_ringbuffer_s * rb)
> {
>         uint32_t old_read_pt;
>         uint32_t new_read_pt;
>         uint32_t old_chunk_size;
>         uint32_t chunk_magic;
>
>         old_read_pt = rb->shared_hdr->read_pt;
>         chunk_magic = QB_RB_CHUNK_MAGIC_GET(rb, old_read_pt);
>    *    if (chunk_magic != QB_RB_CHUNK_MAGIC) {*
> *                return;*
> *        }*
> *
> *
> *
> *
> and there is a commit seems fix it [1], do you know what's the
> background of this commit? does it look to fix it?
>
> Thanks again :)


I don't know enough about the background to that fix. What you're saying
sounds plausible but I can't be sure. There are quite a few stability
fixed in libqb 0.17 so it could be that or one of the others!

Chrissie


> [1]
> https://github.com/ClusterLabs/libqb/commit/a8852fc481e3aa3fce53bb9e3db79d3e7cbed0c1
>
>
>
> On Tue, Apr 21, 2015 at 5:55 PM, Christine Caulfield
> <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>> wrote:
>
>     Hiya,
>
>     It's hard to be sure without more information, sadly - if the backtrace
>     looks similar to the one you mention then upgrading libqb to 0.17 should
>     help.
>
>     Chrissie
>
>     On 21/04/15 07:12, Hui Xiang wrote:
>     > Thanks Christine, sorry for responding late.
>     >
>     > I got this problem again,  and corosync-blackbox just hang there, no
>     > output. there are some other debug information for you guys.
>     >
>     > The backtrace and perf.data are very similar as link [1], but we don't
>     > know what's the root cause, sure restart corosync is one of the
>     > solution, but after a while it breaks again, so we'd like to find out
>     > what's really going on there.
>     >
>     > Thanks for your efforts, very appreciated : )
>     >
>     > [1] http://www.spinics.net/lists/corosync/msg03445.html
>     >
>     >
>     > On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>
>     > <mailto:ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>>> wrote:
>     >
>     >     On 09/02/15 01:59, Hui Xiang wrote:
>     >     > Hi guys,
>     >     >
>     >     >   I am having an issue with corosync where it consumes 100%
>     cpu and hung on
>     >     > the command corosync-quorumtool -l, Recv-Q is very high in
>     the meantime
>     >     > inside lxc container.
>     >     >  corosync version : 2.3.3
>     >     >
>     >     >  transport : unicast
>     >     >
>     >     >  After setting up 3 keystone nodes with corosync/pacemaker,
>     split brain
>     >     > happened, on one of the keystone nodes we found the cpu is
>     100% used by
>     >     > corosync.
>     >     >
>     >
>     >
>     >     It looks like it might be a problem I saw while doing some
>     development
>     >     on corosync, if it gets a SEGV, there's a signal handler that
>     catches it
>     >     and relays it back to libqb via a pipe, causing another SEGV and
>     >     corosync is then just spinning on the pipe for ever. The cause
>     I saw is
>     >     not likely yo be the same as yours (it was my coding at the
>     time ;-) but
>     >     it does sound like a similar effect. The only way round it is
>     to kill
>     >     corosync and restart it. There might be something in the
>     >     corosync-blackbox to indicate what went wrong if that has been
>     saved. If
>     >     you have that then please post it here so we can have a look.
>     >
>     >     man corosync-blackbox
>     >
>     >     Chrissie
>     >
>     >     > **
>     >     >
>     >     > asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie
>     >     > %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi,
>     0.0 si,
>     >     0.0 st
>     >     > KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers
>     >     > KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312
>     cached Mem
>     >     >
>     >     >   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>     >     > 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43
>     corosync
>     >     >
>     >     > From netstat output, one interesting finding is the Recv-Q size
>     >     has a value
>     >     > 320256, which is higher than normal.
>     >     > And after simply doing pkill -9 corosync and restart
>     >     corosync/pacemaker,
>     >     > the whole cluster are back normal.
>     >     >
>     >     > Active Internet connections (only servers)
>     >     > Proto Recv-Q Send-Q Local Address Foreign Address State
>     >     PID/Program name
>     >     > udp 320256 0 192.168.100.67:5434
>     <http://192.168.100.67:5434> <http://192.168.100.67:5434>
>     >     0.0.0.0:* 18637/corosync
>     >     >
>     >     > Udp:
>     >     >     539832 packets received
>     >     >     619 packets to unknown port received.
>     >     >     407249 packet receive errors
>     >     >     1007262 packets sent
>     >     >     RcvbufErrors: 69940
>     >     >
>     >     > **
>     >     >
>     >     >   So I am asking if there is any bug/issue related with corosync
>     >     may cause
>     >     > it slowly receive packets from socket and hung up due to some reason?
>     >     >
>     >     >   Thanks a lot, looking forward for your response.
>     >     >
>     >     >
>     >     > Best Regards.
>     >     >
>     >     > Hui.
>     >     >
>     >     >
>     >     >
>     >     > _______________________________________________
>     >     > discuss mailing list
>     >     > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>     <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>>
>     >     > http://lists.corosync.org/mailman/listinfo/discuss
>     >     >
>     >
>     >     _______________________________________________
>     >     discuss mailing list
>     >     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>     <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>>
>     >     http://lists.corosync.org/mailman/listinfo/discuss
>     >
>     >
>
>


_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux