RE: HEALTH_ERR when (re)starting ceph-osd's

"Piotr.Dalek@xxxxxxxxxxxxxx" <Piotr.Dalek@xxxxxxxxxxxxxx> · Thu, 28 Jan 2016 13:05:10 +0000

> -----Original Message-----
> From: Wido den Hollander [mailto:wido@xxxxxxxx]
> Sent: Thursday, January 28, 2016 1:38 PM
> 
> On 28-01-16 11:48, Piotr.Dalek@xxxxxxxxxxxxxx wrote:
> > Hello,
> >
> > I haven't noticed it before, but since merging
> https://github.com/ceph/ceph/pull/7253 I see that, when restarting
> daemons on healthy ceph cluster, it goes to HEALTH_ERR state with
> "$(random_number) pgs are stuck inactive for more than 300 seconds".
> > I looked at the commit and it turns out it will be always occurring on
> restart/boot, as booting pgs are inactive "by default" (since mons never
> received any sign of life from them) - not because they're actually stuck
> inactive.
> 
> Well, in that case, isn't the PR correct? But I see what you mean.

Actually, the only thing wrong with this is that it reports PGs as inactive for some prolonged period of time, when it's not true.

> > One solution to this would be to mark pg_stat.last_* fields to the point
> where it were first seen, so they will become stuck
> (mon_pg_stuck_threshold) seconds after first registering, and not right
> away.
> 
> That sounds like a good solution, you might want to take a look at:
> http://tracker.ceph.com/issues/14028

I'll take a look. Maybe we could fix two issues with one PR ;-)

With best regards / Pozdrawiam
Piotr Dałek

��.n��������+%������w��{.n����z��u���ܨ}���Ơz�j:+v�����w����ޙ��&�)ߡ�a����z�ޗ���ݢj��w�f