lagging peering wq

Sage Weil <sage@xxxxxxxxxxx> · Fri, 25 Jan 2013 09:50:07 -0800 (PST)

Faidon/paravoid's cluster has a bunch of OSDs that are up, but the pg 
queries indicate they are tens of thousands of epochs behind:

      "history": { "epoch_created": 14,
          "last_epoch_started": 88174,
          "last_epoch_clean": 88174,
          "last_epoch_split": 0,
          "same_up_since": 88172,
          "same_interval_since": 88172,
          "same_primary_since": 88172,

(where the current map epoch is 102000 or thereabouts).

I think just restarting all OSDs at once will get him caught up (esp with 
a 'ceph osd set noup' block until they are done processing maps), but I 
wonder if we may want an additional check that if any PG falls more than X 
epochs behind the OSD marks it self down and catches up before coming 
in...

What do you think?

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html