RE: Osd load and timeout when recovering

Yann ROBIN <yann.robin@xxxxxxxxxxxxx> · Mon, 22 Oct 2012 13:59:59 +0000

>>
>> After looking at the osd, we saw a very high load on osd (450 of load), some were down.
>> Ceph -s displayed that we were having down pg, peering+down pg, remapped pg. etc.
>>
>
>Could you tell us a bit more?
>
When the load was 450, was this mainly due to disk I/O wait?
Did the machines start to swap?

All disk were 100% busy. And server was swapping.

> Could it be that the swapping was actually causing the machines to die even more?

> Although a OSD could run with 100M of memory, during recovery it can grow quite fast.

Is there a way to estimate the needed memory ?

>
> So basically the cluster was under load because we was recovering... but because it was under load recovering could not complete.
>
>
>FileStore aborts indicate that it couldn't get the work done quickly enough. I've seen this with btrfs, but you say you are using XFS.
>
>You say you are storing small files. What exactly is "small"?

In average 120ko.

-- 
Yann ROBIN
www.YouScribe.com

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html