Re: cosd multi-second stalls cause "wrongly marked me down"

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, 2011-03-11 at 15:45 -0700, Sage Weil wrote:
> On Fri, 11 Mar 2011, Jim Schutt wrote:
> > > So it occurs to me that one call to Message::put() entails many 
> > > calls to buffer::ptr::release(), depending on what the message 
> > > is, right?  Maybe time the "delete _raw" in there and assert() 
> > > if it's too long?
> > 
> > Also, any chance all incoming data is causing buffer_total_alloc
> > to be contended?  I don't have libatomic_ops either, so that
> > atomic_t is implemented via a pthread_spinlock_t, right?
> > How to check that?
> 
> Hmm, it could be.  I pushed a nobuffer branch that compiles out the 
> buffer_total_alloc accounting, if you want to give that a go.

That seems to have helped, although it's not a complete solution.

I still got some OSDs failed, but since I use

        osd min down reporters = 3
        osd min down reports = 2

only 1 OSD got marked down; it noticed quickly and marked
itself up, and my 64-client dd finished.  That's new for
me at 96 OSDs.

I saw this on this run:

# grep -Hn RefCountedObject::put osd.*.log | egrep "took [0-9][0-9]\." | wc -l 
192

# grep -Hn RefCountedObject::put osd.*.log | egrep "took [1-9]\." | wc -l
12578

which compares to a previous run in an earlier email:


> > > # grep -Hn RefCountedObject::put osd.*.log | egrep "took [1-9]\." | wc -l
> > > 8911
> > > 
> > > # grep -Hn RefCountedObject::put osd.*.log | egrep "took [0-9][0-9]\." | wc -l
> > > 415

-- Jim

> 
> sage
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux