Re: wip-pg-stale

Gregory Farnum <gregory.farnum@xxxxxxxxxxxxx> · Fri, 27 Jan 2012 14:11:49 -0800



On Fri, Jan 27, 2012 at 1:32 PM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> Please review.
>
> If the monitor sees an osdmap go by where nodes go down (or up) it will
> scan its pg_map and mark any pg whose primary is down as 'stale'.  If/when
> the pg recovers, that will get refreshed.  If not, the admin will know
> something is up.
Hmm. Without any kind of timeout this flag will get set every time an
OSD goes down — the replicas won't alert the new primary until after
they get the map marking their old primary down, and this check will
be run synchronously with the generation of the map marking the OSD
down.
The "spurious" stale marker on each PG isn't a big deal (it'll
disappear after a few seconds), but if we're going to set HEALTH_WARN
based on it, that seems like a bit much to me.

> We'll soon be adding the last_active, last_clean, and now last_unstale (?)
> fields so that bigger alarms can go off when the pg stays stale for more
> than a few seconds...
Yeah; I think we want to use this to trigger big warnings, but not to
trigger warnings without it.
-Greg


>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html