Re: snap_trimming + backfilling is inefficient with many purged_snaps

Gregory Farnum <greg@xxxxxxxxxxx> · Thu, 16 Oct 2014 06:54:05 -0700



On Thu, Oct 16, 2014 at 2:04 AM, Florian Haas <florian@xxxxxxxxxxx> wrote:
> Hi Greg,
>
> sorry, this somehow got stuck in my drafts folder.
>
> On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas <florian@xxxxxxxxxxx> wrote:
>>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:
>>>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>>> On Sun, 21 Sep 2014, Florian Haas wrote:
>>>>>> So yes, I think your patch absolutely still has merit, as would any
>>>>>> means of reducing the number of snapshots an OSD will trim in one go.
>>>>>> As it is, the situation looks really really bad, specifically
>>>>>> considering that RBD and RADOS are meant to be super rock solid, as
>>>>>> opposed to say CephFS which is in an experimental state. And contrary
>>>>>> to CephFS snapshots, I can't recall any documentation saying that RBD
>>>>>> snapshots will break your system.
>>>>>
>>>>> Yeah, it sounds like a separate issue, and no, the limit is not
>>>>> documented because it's definitely not the intended behavior. :)
>>>>>
>>>>> ...and I see you already have a log attached to #9503.  Will take a look.
>>>>
>>>> I've already updated that issue in Redmine, but for the list archives
>>>> I should also add this here: Dan's patch for #9503, together with
>>>> Sage's for #9487, makes the problem go away in an instant. I've
>>>> already pointed out that I owe Dan dinner, and Sage, well I already
>>>> owe Sage pretty much lifelong full board. :)
>>>
>>> Looks like I was bit too eager: while the cluster is behaving nicely
>>> with these patches while nothing happens to any OSDs, it does flag PGs
>>> as incomplete when an OSD goes down. Once the mon osd down out
>>> interval expires things seem to recover/backfill normally, but it's
>>> still disturbing to see this in the interim.
>>>
>>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
>>> one of the affected PGs, within the mon osd down out interval, while
>>> it was marked incomplete.
>>>
>>> Dan or Sage, any ideas as to what might be causing this?
>>
>> That *looks* like it's just because the pool has both size and
>> min_size set to 2?
>
> Correct. But the documentation did not reflect that this is a
> perfectly expected side effect of having min_size > 1.
>
> pg-states.rst says:
>
> *Incomplete*
>   Ceph detects that a placement group is missing a necessary period of history
>   from its log.  If you see this state, report a bug, and try to start any
>   failed OSDs that may contain the needed information.
>
> So if min_size > 1 and replicas < min_size, then the incomplete state
> is not a bug but a perfectly expected occurrence, correct?
>
> It's still a bit weird in that the PG seems to behave differently
> depending on min_size. If min_size == 1 (default), then a PG with no
> remaining replicas is stale, unless a replica failed first and the
> primary was written to, after which it also failed, and the replica
> then comes up and can't go primary because it now has outdated data,
> in which case the PG goes "down". It never goes "incomplete".
>
> So is the documentation wrong, or is there something fishy with the
> reported state of the PGs?

I guess the documentation is wrong, although I thought we'd fixed that
particular one. :/ Giant actually distinguishes between these
conditions by adding an "undersized" state to the PG, so it'll be
easier to diagnose.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html