Re: cosd multi-second stalls cause "wrongly marked me down"

Sage Weil <sage@xxxxxxxxxxxx> · Mon, 11 Apr 2011 16:23:20 -0700 (PDT)

On Mon, 11 Apr 2011, Jim Schutt wrote:
> Jim Schutt wrote:
> > Sage Weil wrote:
> > > On Fri, 8 Apr 2011, Jim Schutt wrote:
> 
> > > > So, in the short term I guess I need to run fewer cosd
> > > > instances per server.
> > > 
> > > There is one other thing to look at, and that's the number of threads used
> > > by each cosd process.  Have you tried setting
> > > 
> > >     osd op threads = 1
> > > 
> > > (or even 0, although I haven't tested that recently).  That will limit the
> > > number of concurrent IOs in flight to the fs.  Setting it to 0 will avoid
> > > using a thread pool at all and will process the IO in the message dispatch
> > > thread (though we haven't tested that recently so there may be issues).
> > 
> > I'll try this 2nd, since it's easy.
> > 
>      osd op threads = 0
> 
> didn't work for me at all - 20 of 96 OSDs aborted almost
> immediately after startup.
> 
>      osd op threads = 1
> 
> didn't work very well either - one of my servers went OOM,
> which hasn't happened since I started using my restricted
> buffering parameters.

Debugging this turned up a refcounting leak that meant PGs were never 
freed.  That's fixed (and osd op threads = 0 and 1 work in my limited 
tests), but there may be other PG lifecycle related issues now that 
refcounting actually works.  :)

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html