Re: Hit suicide timeout after adding new osd

Wido den Hollander <wido@xxxxxxxxx> · Wed, 23 Jan 2013 13:26:33 +0100

On 01/23/2013 01:14 PM, Jens Kristian Søgaard wrote:
Hi Sage,

I think the problem now is just that 'osd target transaction size' is
too big (default is 300).  Recommended 50.. let's see how that goes.
Even smaller (20 or 25) would probably be fine.

Going through the code and reading that this solved it for Jens, could 
this issue be traced back to less powerful CPUs?

I've seen this on Atom and Fusion platforms which both don't excel in 
their computing power.

From what I read is that the OSD by default does 300 transactions and 
then commits them? If the CPU is to slow to handle all the work timeouts 
can occur because it can't do all the transactions inside the set window?

By lowering the number of transactions it sends out a heartbeat more 
often thus keeping itself alive.

Correct?

Wido

I set it to 50, and that seems to have solved all my problems.

After a day or so my cluster got to a HEALTH_OK state again. It has been
running for a few days now without any crashes!

Thanks for all your help!

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html