Re: cosd multi-second stalls cause "wrongly marked me down"

"Jim Schutt" <jaschut@xxxxxxxxxx> · Fri, 11 Mar 2011 15:26:49 -0700

On Fri, 2011-03-11 at 15:21 -0700, Jim Schutt wrote:
> On Fri, 2011-03-11 at 14:37 -0700, Sage Weil wrote:
> > On Fri, 11 Mar 2011, Jim Schutt wrote:
> > > 
> > > On Fri, 2011-03-11 at 11:37 -0700, Sage Weil wrote:
> > > > On Fri, 11 Mar 2011, Jim Schutt wrote:
> > > > > So none of those were osd_ping messages.
> > > > > 
> > > > > But, I still had lots of delayed acks.  Here's a couple more examples:
> > > > > 
> > > > > osd.0.log:960713:2011-03-11 10:55:32.117721 7fc4cb7fe940 -- 172.17.40.21:6802/28363 --> osd74 172.17.40.31:6808/24916 -- osd_ping(e14 as_of 14) v1 -- ?+0 0x7fc4cc16f270
> > > > > osd.0.log:960756:2011-03-11 10:55:32.118395 7fc4c96eb940 -- 172.17.40.21:6802/28363 >> 172.17.40.31:6808/24916 pipe(0x28fb580 sd=53 pgs=47 cs=1 l=0).writer encoding 289 0x7fc4cc16f270 osd_ping(e14 as_of 14) v1
> > > > > osd.0.log:960757:2011-03-11 10:55:32.118409 7fc4c96eb940 -- 172.17.40.21:6802/28363 >> 172.17.40.31:6808/24916 pipe(0x28fb580 sd=53 pgs=47 cs=1 l=0).writer sending 289 0x7fc4cc16f270
> > > > > osd.0.log:960758:2011-03-11 10:55:32.118422 7fc4c96eb940 -- 172.17.40.21:6802/28363 >> 172.17.40.31:6808/24916 pipe(0x28fb580 sd=53 pgs=47 cs=1 l=0).write_message 0x7fc4cc16f270 osd_ping(e14 as_of 14) v1
> > > > 
> > > > This bit has me confused:
> > > > 
> > > > > osd.0.log:963163:2011-03-11 10:55:32.941413 7fc4c61b6940 -- 172.17.40.21:6802/28363 >> 172.17.40.31:6808/24916 pipe(0x28fb580 sd=53 pgs=47 cs=1 l=0).reader got ack seq 289 >= 289 on 0x7fc4cc16f270 osd_ping(e14 as_of 14) v1
> > > > > osd.0.log:964273:2011-03-11 10:55:33.447526 7fc4c61b6940 RefCountedObject::put delete 0x7fc4cc16f270 took 5.06013 secs!
> > > > > osd.0.log:964274:2011-03-11 10:55:33.447538 7fc4c61b6940 -- 172.17.40.21:6802/28363 >> 172.17.40.31:6808/24916 pipe(0x28fb580 sd=53 pgs=47 cs=1 l=0).handle_ack finished put on 0x7fc4cc16f270
> > > > 
> > > > It looks like ~ .5 seconds has gone by for that thread, but the ::put 
> > > > debug says 5 seconds.  It happens between the 'got ack seq' and 'finished 
> > > > put' lines, though, right?  
> > > 
> > > OK, with Greg's utime_t double fix, I git no hits on
> > > 
> > > # egrep -Hn --color -e "->put took" osd.*.log
> > > 
> > > 
> > > But, I still get lots of stalled :RefCountedObject::put delete:
> > > 
> > > # grep -Hn RefCountedObject::put osd.*.log | egrep "took [1-9]\." | wc -l
> > > 8911
> > > 
> > > # grep -Hn RefCountedObject::put osd.*.log | egrep "took [0-9][0-9]\." | wc -l
> > > 415
> > 
> > Hmm!  That does seem to point at the allocator, doesn't it.
> > 
> > Other threads are doing work during this long interval?  Including freeing 
> > memory, presumably, since basically everything uses the heap one way or 
> > another.  If it's the allocator, it's somehow affecting one thread only, 
> > which is pretty crazy.
> 
> As an example,  consider this period:
> 
> osd.34.log:1083828:2011-03-11 12:38:33.091345 7f8972f6f940 -- 172.17.40.25:6808/8345 >> 172.17.40.30:6814/8599 pipe(0x7f897c186c70 sd=109 pgs=64 cs=1 l=0).reader got ack seq 326 >= 326 on 0x7f897c47cd00 osd_ping(e5 as_of 5) v1
> osd.34.log:1096768:2011-03-11 12:38:57.143061 7f8972f6f940 RefCountedObject::put delete 0x7f897c47cd00 took 23.9321 secs!
> osd.34.log:1096769:2011-03-11 12:38:57.143079 7f8972f6f940 -- 172.17.40.25:6808/8345 >> 172.17.40.30:6814/8599 pipe(0x7f897c186c70 sd=109 pgs=64 cs=1 l=0).handle_ack finished put on 0x7f897c47cd00
> 
> In that period 225 unique threads ran, 10887 lines of logging
> were generated, of which 9985 contained "pipe".  So, lots of
> network activity.  6881 lines of log contained "reader", 
> 1715 lines of log contained "writer".
> 
> And, one other instance of a stalled delete:
> 
> # egrep -Hn -B 1000000 "^2011-03-11 12:38:57.023162" osd.34.log | egrep -A 1000000 "^osd[^ ]*011-03-11 12:38:33.055826" | grep "put delete"
> osd.34.log-1094490-2011-03-11 12:38:57.022896 7f8975c9c940 RefCountedObject::put delete 0x7f897c3484e0 took 16.0578 secs!
> osd.34.log-1094501-2011-03-11 12:38:57.023129 7f8972c6c940 RefCountedObject::put delete 0x7f897c4ca050 took 23.9673 secs!

> 
> 
> > 
> > Is it difficult for you to try this with tcmalloc?  That'll tell us 
> > something.
> 
> I don't think so.  But, I'm out next week so won't get to
> it until Mar 21 :(
> 
> > 
> > One other possibility would be to try to catch this "in the act" and send 
> > it a SIGABRT to get a core dump.  Then we can look in more detail at what 
> > this (and other) threads are up to.  I'm not sure how easy this is to 
> > catch on a particular node...
> 
> So it occurs to me that one call to Message::put() entails many 
> calls to buffer::ptr::release(), depending on what the message 
> is, right?  Maybe time the "delete _raw" in there and assert() 
> if it's too long?

Also, any chance all incoming data is causing buffer_total_alloc
to be contended?  I don't have libatomic_ops either, so that
atomic_t is implemented via a pthread_spinlock_t, right?
How to check that?

-- Jim

> 
> -- Jim
> 
> > 
> > sage
> > 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html