Re: degraded PGs when adding OSDs

Brad Hubbard <bhubbard@xxxxxxxxxx> · Mon, 12 Feb 2018 09:21:53 +1000

On Mon, Feb 12, 2018 at 8:51 AM, Simon Ironside <sironside@xxxxxxxxxxxxx> wrote:
> On 09/02/18 09:05, Janne Johansson wrote:
>>
>> 2018-02-08 23:38 GMT+01:00 Simon Ironside <sironside@xxxxxxxxxxxxx
>> <mailto:sironside@xxxxxxxxxxxxx>>:
>>
>>     Hi Everyone,
>>     I recently added an OSD to an active+clean Jewel (10.2.3) cluster
>>     and was surprised to see a peak of 23% objects degraded. Surely this
>>     should be at or near zero and the objects should show as misplaced?
>>     I've searched and found Chad William Seys' thread from 2015 but
>>     didn't see any conclusion that explains this:
>>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html
>>
>> <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html>
>>
>>   I agree, I always viewed it as if you had three copies of your PG, add a
>> new OSD and that PG decides one of the copies should be on that OSD instead
>> of one of the 3 older ones, it would just stop caring about the old PG,
>> create a new empty PG on the new OSD and then as the synch is going towards
>> the new PG it is "behind" in the data it contains until sync is done, but it
>> (and its 2 previous copies) are correctly placed for the new crush map.
>> Misplaced would probably be a more natural way of seeing it, at least if the
>> now-abandoned PG was still being updated while the sync is done, but I don't
>> think it is. It gets orphaned rather quickly as the new OSD kicks in.
>>
>> I guess this design choice boils down to "being able to handle someone
>> adding more OSDs to a cluster that is close to getting full", at the expense
>> of "discarding one or more of the old copies and scaring the admin as if
>> there was a huge issue when just adding one or many new shiny OSDs".
>
>
> It certainly does scare me, especially as this particular cluster is size=2,
> min_size=1.
>
> My worry is that I could experience a disk failure while adding a new OSD
> and potentially lose data

You've already indicated you are willing to accept data loss by
configuring size=2,  min_size=1.

Search for "2x replication: A BIG warning"

> while if the same disk failed when the cluster was
> active+clean I wouldn't. That doesn't seem like a very safe design choice
> but perhaps the real answer is to use size=3.
>
> Reweighting an active OSD to 0 does the same thing on my cluster, causes the
> objects to go degraded instead of misplaced as I'd expect.
>
>
> Thanks,
> Simon.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com