On 21/04/15 12:37, Hui Xiang wrote: > Thanks Christine. > > One more question, in the broken environment, we found part of the > source code in libqb as below: > 1) > void * > qb_rb_chunk_alloc(struct qb_ringbuffer_s * rb, size_t len) > { > uint32_t write_pt; > > if (rb == NULL) { > errno = EINVAL; > return NULL; > } > /* > * Reclaim data if we are over writing and we need space > */ > if (rb->flags & QB_RB_FLAG_OVERWRITE) { > while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) { > *_rb_chunk_reclaim(rb);* > } > } else { > if (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) { > errno = EAGAIN; > return NULL; > } > } > > but in the master branch: > 2) > while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) { > * int rc = _rb_chunk_reclaim(rb);* > * if (rc != 0) {* > * errno = rc;* > * return NULL;* > } > } > > > is it possible that the code 1) we have been stucked in the infinite > loop of > while (qb_rb_space_free(rb) < (len + QB_RB_CHUNK_MARGIN)) {...} on the > condition that 'chunk_magic != QB_RB_CHUNK_MAGIC', function > _rb_chunk_reclaim() just return: > static void > _rb_chunk_reclaim(struct qb_ringbuffer_s * rb) > { > uint32_t old_read_pt; > uint32_t new_read_pt; > uint32_t old_chunk_size; > uint32_t chunk_magic; > > old_read_pt = rb->shared_hdr->read_pt; > chunk_magic = QB_RB_CHUNK_MAGIC_GET(rb, old_read_pt); > * if (chunk_magic != QB_RB_CHUNK_MAGIC) {* > * return;* > * }* > * > * > * > * > and there is a commit seems fix it [1], do you know what's the > background of this commit? does it look to fix it? > > Thanks again :) I don't know enough about the background to that fix. What you're saying sounds plausible but I can't be sure. There are quite a few stability fixed in libqb 0.17 so it could be that or one of the others! Chrissie > [1] > https://github.com/ClusterLabs/libqb/commit/a8852fc481e3aa3fce53bb9e3db79d3e7cbed0c1 > > > > On Tue, Apr 21, 2015 at 5:55 PM, Christine Caulfield > <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>> wrote: > > Hiya, > > It's hard to be sure without more information, sadly - if the backtrace > looks similar to the one you mention then upgrading libqb to 0.17 should > help. > > Chrissie > > On 21/04/15 07:12, Hui Xiang wrote: > > Thanks Christine, sorry for responding late. > > > > I got this problem again, and corosync-blackbox just hang there, no > > output. there are some other debug information for you guys. > > > > The backtrace and perf.data are very similar as link [1], but we don't > > know what's the root cause, sure restart corosync is one of the > > solution, but after a while it breaks again, so we'd like to find out > > what's really going on there. > > > > Thanks for your efforts, very appreciated : ) > > > > [1] http://www.spinics.net/lists/corosync/msg03445.html > > > > > > On Mon, Feb 9, 2015 at 4:38 PM, Christine Caulfield <ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx> > > <mailto:ccaulfie@xxxxxxxxxx <mailto:ccaulfie@xxxxxxxxxx>>> wrote: > > > > On 09/02/15 01:59, Hui Xiang wrote: > > > Hi guys, > > > > > > I am having an issue with corosync where it consumes 100% > cpu and hung on > > > the command corosync-quorumtool -l, Recv-Q is very high in > the meantime > > > inside lxc container. > > > corosync version : 2.3.3 > > > > > > transport : unicast > > > > > > After setting up 3 keystone nodes with corosync/pacemaker, > split brain > > > happened, on one of the keystone nodes we found the cpu is > 100% used by > > > corosync. > > > > > > > > > It looks like it might be a problem I saw while doing some > development > > on corosync, if it gets a SEGV, there's a signal handler that > catches it > > and relays it back to libqb via a pipe, causing another SEGV and > > corosync is then just spinning on the pipe for ever. The cause > I saw is > > not likely yo be the same as yours (it was my coding at the > time ;-) but > > it does sound like a similar effect. The only way round it is > to kill > > corosync and restart it. There might be something in the > > corosync-blackbox to indicate what went wrong if that has been > saved. If > > you have that then please post it here so we can have a look. > > > > man corosync-blackbox > > > > Chrissie > > > > > ** > > > > > > asks: 42 total, 2 running, 40 sleeping, 0 stopped, 0 zombie > > > %Cpu(s):100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, > 0.0 si, > > 0.0 st > > > KiB Mem: 1017896 total, 932296 used, 85600 free, 19148 buffers > > > KiB Swap: 1770492 total, 5572 used, 1764920 free. 409312 > cached Mem > > > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > > 18637 root 20 0 704252 199272 34016 R 99.9 19.6 44:40.43 > corosync > > > > > > From netstat output, one interesting finding is the Recv-Q size > > has a value > > > 320256, which is higher than normal. > > > And after simply doing pkill -9 corosync and restart > > corosync/pacemaker, > > > the whole cluster are back normal. > > > > > > Active Internet connections (only servers) > > > Proto Recv-Q Send-Q Local Address Foreign Address State > > PID/Program name > > > udp 320256 0 192.168.100.67:5434 > <http://192.168.100.67:5434> <http://192.168.100.67:5434> > > 0.0.0.0:* 18637/corosync > > > > > > Udp: > > > 539832 packets received > > > 619 packets to unknown port received. > > > 407249 packet receive errors > > > 1007262 packets sent > > > RcvbufErrors: 69940 > > > > > > ** > > > > > > So I am asking if there is any bug/issue related with corosync > > may cause > > > it slowly receive packets from socket and hung up due to some reason? > > > > > > Thanks a lot, looking forward for your response. > > > > > > > > > Best Regards. > > > > > > Hui. > > > > > > > > > > > > _______________________________________________ > > > discuss mailing list > > > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> > <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>> > > > http://lists.corosync.org/mailman/listinfo/discuss > > > > > > > _______________________________________________ > > discuss mailing list > > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> > <mailto:discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>> > > http://lists.corosync.org/mailman/listinfo/discuss > > > > > > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss