Re: cosd multi-second stalls cause "wrongly marked me down"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 3 Mar 2011, Jim Schutt wrote:
> 
> On Wed, 2011-03-02 at 22:03 -0700, Sage Weil wrote:
> > > I'm not sure how to track down what's happening here...
> > 
> > Hmm.  I'm not able to reproduce this here (tho I only have ~15 nodes 
> > available at the moment).  Seeing the last bit of the logs on the crashed 
> > nodes will help.
> > 

Can you confirm that the chdir is working now?  Maybe put an assert(0) in 
tick() so we can verify core dumps are working in general?

Also, can you confirm that there's nothing interesting in dmesg on these 
nodes (like OOM)?

Thanks-
sage


> 
> So this might be interesting.  In my last email, osd.15.log ended with
> 
> 2011-03-03 08:35:29.933436 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).do_sendmail short write did 195207, still have 91335
> 
> 
> It occurred to me you might like to know what thread
> 7fb3d545c940 was doing when it got that short write:
> 
> # grep 7fb3d545c940 osd.15.log | tail
> 2011-03-03 08:32:33.108190 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer encoding 45 0x7fb3c4ad6970 pg_stats(1228 pgs v 6) v1
> 2011-03-03 08:32:33.114972 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer sending 45 0x7fb3c4ad6970
> 2011-03-03 08:32:33.115001 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_message 0x7fb3c4ad6970
> 2011-03-03 08:34:01.154979 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer: state = 2 policy.server=0
> 2011-03-03 08:34:01.154991 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_keepalive
> 2011-03-03 08:34:01.155010 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_ack 29
> 2011-03-03 08:34:01.155041 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer encoding 46 0x7fb3c4b9fd90 pg_stats(1228 pgs v 6) v1
> 2011-03-03 08:34:01.163035 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).writer sending 46 0x7fb3c4b9fd90
> 2011-03-03 08:34:01.163069 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).write_message 0x7fb3c4b9fd90
> 2011-03-03 08:35:29.933436 7fb3d545c940 -- 172.17.40.22:6821/27793 >> 172.17.40.34:6789/0 pipe(0x7fb3c4001270 sd=12 pgs=2580 cs=1 l=1).do_sendmail short write did 195207, still have 91335
> 
> I assume this means the short write happened on sending
> pg_stats? 172.17.40.34 is where my monitor is running.
> 
> -- Jim
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux