Re: cosd multi-second stalls cause "wrongly marked me down"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 2011-03-03 at 11:04 -0700, Sage Weil wrote:
> On Thu, 3 Mar 2011, Jim Schutt wrote:
> > 
> > On Wed, 2011-03-02 at 22:03 -0700, Sage Weil wrote:
> > > > I'm not sure how to track down what's happening here...
> > > 
> > > Hmm.  I'm not able to reproduce this here (tho I only have ~15 nodes 
> > > available at the moment).  Seeing the last bit of the logs on the crashed 
> > > nodes will help.
> > > 
> 
> Can you confirm that the chdir is working now?  Maybe put an assert(0) in 
> tick() so we can verify core dumps are working in general?

Great idea, and chdir is definitely working; got 96 core 
files as expected.

> 
> Also, can you confirm that there's nothing interesting in dmesg on these 
> nodes (like OOM)?

The only thing even remotely interesting is the occasional
btrfs message such as:
  [ 7778.199273] btrfs: unlinked 1 orphans
  [69347.002760] btrfs: truncated 1 orphans

Otherwise, no kernel stack traces of the sort I'm
used to seeing; 'dmesg | egrep -i "oom|mem|btrfs"'
only shows those orphan messages.

-- Jim

> 
> Thanks-
> sage
> 
> 
> > 
> > So this might be interesting.  In my last email, osd.15.log ended with
> > 
> > 2011-03-03 08:35:29.933436 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).do_sendmail short write did 195207, still have 91335
> > 
> > 
> > It occurred to me you might like to know what thread
> > 7fb3d545c940 was doing when it got that short write:
> > 
> > # grep 7fb3d545c940 osd.15.log | tail
> > 2011-03-03 08:32:33.108190 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer encoding 45 0x7fb3c4ad6970 pg_stats(1228 pgs v 6) v1
> > 2011-03-03 08:32:33.114972 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer sending 45 0x7fb3c4ad6970
> > 2011-03-03 08:32:33.115001 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_message 0x7fb3c4ad6970
> > 2011-03-03 08:34:01.154979 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer: state = 2 policy.server=0
> > 2011-03-03 08:34:01.154991 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_keepalive
> > 2011-03-03 08:34:01.155010 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_ack 29
> > 2011-03-03 08:34:01.155041 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer encoding 46 0x7fb3c4b9fd90 pg_stats(1228 pgs v 6) v1
> > 2011-03-03 08:34:01.163035 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer sending 46 0x7fb3c4b9fd90
> > 2011-03-03 08:34:01.163069 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_message 0x7fb3c4b9fd90
> > 2011-03-03 08:35:29.933436 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).do_sendmail short write did 195207, still have 91335
> > 
> > I assume this means the short write happened on sending
> > pg_stats? 172.17.40.34 is where my monitor is running.
> > 
> > -- Jim
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux