On Mon, 28 Mar 2011, Sylar Shen wrote: > Hi, > I set an environment of 20 servers which include 2 MDSs, 3 MONs and 18 > OSDes(3 monitors on 18 OSDes) > My version is 0.24.3 and OS is Fedora 14. > There's a problem when I was doing the writing tests. > Whether I was writing the data or not, some OSDes were randomly marked > down and out one by one after a period of time. > And when that happened, the whole performance soon got worse and worse. > I checked the /var/log/ceph/osd.log but found nothing. > So I am curious that is there anyone who has the same problem with me? > Or maybe it's just a problem of my hardware......>< Hi Sylar, This is/was a known problem. There's a long thread from a couple weeks back with Jim Schutt debugging the issue. We've fixed a few different things that have significantly improved the situation, but the heartbeats are still failing from time to time. I suspect using a more recent release will be sufficient at your scale, either 0.25.2 or the latest 'next' branch from git (there are autobuilt debs for that too). You can also increase the 'osd heartbeat grace' to make the system less sensitive to the transient hangs that are preventing the heartbeats from going out. Please let us know what you find, either here or on #ceph. Thanks! sage -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html