Re: snap_trimming + backfilling is inefficient with many purged_snaps

Florian Haas <florian@xxxxxxxxxxx> · Thu, 16 Oct 2014 11:04:51 +0200

Hi Greg,

sorry, this somehow got stuck in my drafts folder.

On Tue, Sep 23, 2014 at 10:00 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
> On Tue, Sep 23, 2014 at 6:20 AM, Florian Haas <florian@xxxxxxxxxxx> wrote:
>> On Mon, Sep 22, 2014 at 7:06 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:
>>> On Sun, Sep 21, 2014 at 9:52 PM, Sage Weil <sweil@xxxxxxxxxx> wrote:
>>>> On Sun, 21 Sep 2014, Florian Haas wrote:
>>>>> So yes, I think your patch absolutely still has merit, as would any
>>>>> means of reducing the number of snapshots an OSD will trim in one go.
>>>>> As it is, the situation looks really really bad, specifically
>>>>> considering that RBD and RADOS are meant to be super rock solid, as
>>>>> opposed to say CephFS which is in an experimental state. And contrary
>>>>> to CephFS snapshots, I can't recall any documentation saying that RBD
>>>>> snapshots will break your system.
>>>>
>>>> Yeah, it sounds like a separate issue, and no, the limit is not
>>>> documented because it's definitely not the intended behavior. :)
>>>>
>>>> ...and I see you already have a log attached to #9503.  Will take a look.
>>>
>>> I've already updated that issue in Redmine, but for the list archives
>>> I should also add this here: Dan's patch for #9503, together with
>>> Sage's for #9487, makes the problem go away in an instant. I've
>>> already pointed out that I owe Dan dinner, and Sage, well I already
>>> owe Sage pretty much lifelong full board. :)
>>
>> Looks like I was bit too eager: while the cluster is behaving nicely
>> with these patches while nothing happens to any OSDs, it does flag PGs
>> as incomplete when an OSD goes down. Once the mon osd down out
>> interval expires things seem to recover/backfill normally, but it's
>> still disturbing to see this in the interim.
>>
>> I've updated http://tracker.ceph.com/issues/9503 with a pg query from
>> one of the affected PGs, within the mon osd down out interval, while
>> it was marked incomplete.
>>
>> Dan or Sage, any ideas as to what might be causing this?
>
> That *looks* like it's just because the pool has both size and
> min_size set to 2?

Correct. But the documentation did not reflect that this is a
perfectly expected side effect of having min_size > 1.

pg-states.rst says:

*Incomplete*
  Ceph detects that a placement group is missing a necessary period of history
  from its log.  If you see this state, report a bug, and try to start any
  failed OSDs that may contain the needed information.

So if min_size > 1 and replicas < min_size, then the incomplete state
is not a bug but a perfectly expected occurrence, correct?

It's still a bit weird in that the PG seems to behave differently
depending on min_size. If min_size == 1 (default), then a PG with no
remaining replicas is stale, unless a replica failed first and the
primary was written to, after which it also failed, and the replica
then comes up and can't go primary because it now has outdated data,
in which case the PG goes "down". It never goes "incomplete".

So is the documentation wrong, or is there something fishy with the
reported state of the PGs?

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html