Re: wip-pg-stale

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 27 Jan 2012 15:43:15 -0800 (PST)

On Fri, 27 Jan 2012, Gregory Farnum wrote:
> On Fri, Jan 27, 2012 at 1:32 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> > Please review.
> >
> > If the monitor sees an osdmap go by where nodes go down (or up) it will
> > scan its pg_map and mark any pg whose primary is down as 'stale'.  If/when
> > the pg recovers, that will get refreshed.  If not, the admin will know
> > something is up.
> Hmm. Without any kind of timeout this flag will get set every time an
> OSD goes down ? the replicas won't alert the new primary until after
> they get the map marking their old primary down, and this check will
> be run synchronously with the generation of the map marking the OSD
> down.
> The "spurious" stale marker on each PG isn't a big deal (it'll
> disappear after a few seconds), but if we're going to set HEALTH_WARN
> based on it, that seems like a bit much to me.

My thought is that as soon as we add the time stamps to the state 
transition, it'll only warn once things are stale for a while.  We already 
have the same problem with degraded/peering/etc with the health checks...

sage

> 
> > We'll soon be adding the last_active, last_clean, and now last_unstale (?)
> > fields so that bigger alarms can go off when the pg stays stale for more
> > than a few seconds...
> Yeah; I think we want to use this to trigger big warnings, but not to
> trigger warnings without it.
> -Greg
> 
> 
> >
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
>