On Thu, Oct 16, 2014 at 2:04 AM, Florian Haas <florian@xxxxxxxxxxx> wrote: > Hi Greg, > > sorry, this somehow got stuck in my drafts folder. > > On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas <florian@xxxxxxxxxxx> wrote: >>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas <florian@xxxxxxxxxxx> wrote: >>>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@xxxxxxxxxx> wrote: >>>>> On Sun, 21 Sep 2014, Florian Haas wrote: >>>>>> So yes, I think your patch absolutely still has merit, as would any >>>>>> means of reducing the number of snapshots an OSD will trim in one go. >>>>>> As it is, the situation looks really really bad, specifically >>>>>> considering that RBD and RADOS are meant to be super rock solid, as >>>>>> opposed to say CephFS which is in an experimental state. And contrary >>>>>> to CephFS snapshots, I can't recall any documentation saying that RBD >>>>>> snapshots will break your system. >>>>> >>>>> Yeah, it sounds like a separate issue, and no, the limit is not >>>>> documented because it's definitely not the intended behavior. :) >>>>> >>>>> ...and I see you already have a log attached to #9503. Will take a look. >>>> >>>> I've already updated that issue in Redmine, but for the list archives >>>> I should also add this here: Dan's patch for #9503, together with >>>> Sage's for #9487, makes the problem go away in an instant. I've >>>> already pointed out that I owe Dan dinner, and Sage, well I already >>>> owe Sage pretty much lifelong full board. :) >>> >>> Looks like I was bit too eager: while the cluster is behaving nicely >>> with these patches while nothing happens to any OSDs, it does flag PGs >>> as incomplete when an OSD goes down. Once the mon osd down out >>> interval expires things seem to recover/backfill normally, but it's >>> still disturbing to see this in the interim. >>> >>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from >>> one of the affected PGs, within the mon osd down out interval, while >>> it was marked incomplete. >>> >>> Dan or Sage, any ideas as to what might be causing this? >> >> That *looks* like it's just because the pool has both size and >> min_size set to 2? > > Correct. But the documentation did not reflect that this is a > perfectly expected side effect of having min_size > 1. > > pg-states.rst says: > > *Incomplete* > Ceph detects that a placement group is missing a necessary period of history > from its log. If you see this state, report a bug, and try to start any > failed OSDs that may contain the needed information. > > So if min_size > 1 and replicas < min_size, then the incomplete state > is not a bug but a perfectly expected occurrence, correct? > > It's still a bit weird in that the PG seems to behave differently > depending on min_size. If min_size == 1 (default), then a PG with no > remaining replicas is stale, unless a replica failed first and the > primary was written to, after which it also failed, and the replica > then comes up and can't go primary because it now has outdated data, > in which case the PG goes "down". It never goes "incomplete". > > So is the documentation wrong, or is there something fishy with the > reported state of the PGs? I guess the documentation is wrong, although I thought we'd fixed that particular one. :/ Giant actually distinguishes between these conditions by adding an "undersized" state to the PG, so it'll be easier to diagnose. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html