Re: The stability of OSD

Sage Weil <sage@xxxxxxxxxxxx> · Sun, 27 Mar 2011 21:37:47 -0700 (PDT)

On Mon, 28 Mar 2011, Sylar Shen wrote:
> Hi,
> I set an environment of 20 servers which include 2 MDSs, 3 MONs and 18
> OSDes(3 monitors on 18 OSDes)
> My version is 0.24.3 and OS is Fedora 14.
> There's a problem when I was doing the writing tests.
> Whether I was writing the data or not, some OSDes were randomly marked
> down and out one by one after a period of time.
> And when that happened, the whole performance soon got worse and worse.
> I checked the /var/log/ceph/osd.log but found nothing.
> So I am curious that is there anyone who has the same problem with me?
> Or maybe it's just a problem of my hardware......><

Hi Sylar,

This is/was a known problem.  There's a long thread from a couple weeks 
back with Jim Schutt debugging the issue.  We've fixed a few different 
things that have significantly improved the situation, but the heartbeats 
are still failing from time to time.

I suspect using a more recent release will be sufficient at your scale, 
either 0.25.2 or the latest 'next' branch from git (there are autobuilt 
debs for that too).  You can also increase the 'osd heartbeat grace' to 
make the system less sensitive to the transient hangs that are preventing 
the heartbeats from going out.

Please let us know what you find, either here or on #ceph.

Thanks!
sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html