On Wed, 23 Jan 2013, Wido den Hollander wrote: > On 01/23/2013 01:14 PM, Jens Kristian S?gaard wrote: > > Hi Sage, > > > > > I think the problem now is just that 'osd target transaction size' is > > > too big (default is 300). Recommended 50.. let's see how that goes. > > > Even smaller (20 or 25) would probably be fine. > > > > Going through the code and reading that this solved it for Jens, could this > issue be traced back to less powerful CPUs? > > I've seen this on Atom and Fusion platforms which both don't excel in their > computing power. > > From what I read is that the OSD by default does 300 transactions and then > commits them? If the CPU is to slow to handle all the work timeouts can occur > because it can't do all the transactions inside the set window? > > By lowering the number of transactions it sends out a heartbeat more often > thus keeping itself alive. > > Correct? In this case, it controls how many operations we stuff into an atomic transaction when doing something big (like deleting an entire PG). The speed is as much about the storage as the CPU, although I'm sure a small CPU helps slow things down. The thread needs to be able to do those N unlinks (or whatever) within the heartbeat interval or else the OSD will consider the thread stuck and zap itself. I think 300 was just a silly initial value... Te default is now either 30 or 50. sage > > Wido > > > I set it to 50, and that seems to have solved all my problems. > > > > After a day or so my cluster got to a HEALTH_OK state again. It has been > > running for a few days now without any crashes! > > > > Thanks for all your help! > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html