Re: degraded PGs when adding OSDs

Simon Ironside <sironside@xxxxxxxxxxxxx> · Sun, 11 Feb 2018 22:51:08 +0000

On 09/02/18 09:05, Janne Johansson wrote:
2018-02-08 23:38 GMT+01:00 Simon Ironside <sironside@xxxxxxxxxxxxx 
<mailto:sironside@xxxxxxxxxxxxx>>:

    Hi Everyone,
    I recently added an OSD to an active+clean Jewel (10.2.3) cluster
    and was surprised to see a peak of 23% objects degraded. Surely this
    should be at or near zero and the objects should show as misplaced?
    I've searched and found Chad William Seys' thread from 2015 but
    didn't see any conclusion that explains this:
    http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html
    <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-July/003355.html>

  I agree, I always viewed it as if you had three copies of your PG, add 
a new OSD and that PG decides one of the copies should be on that OSD 
instead of one of the 3 older ones, it would just stop caring about the 
old PG, create a new empty PG on the new OSD and then as the synch is 
going towards the new PG it is "behind" in the data it contains until 
sync is done, but it (and its 2 previous copies) are correctly placed 
for the new crush map. Misplaced would probably be a more natural way of 
seeing it, at least if the now-abandoned PG was still being updated 
while the sync is done, but I don't think it is. It gets orphaned rather 
quickly as the new OSD kicks in.

I guess this design choice boils down to "being able to handle someone 
adding more OSDs to a cluster that is close to getting full", at the 
expense of "discarding one or more of the old copies and scaring the 
admin as if there was a huge issue when just adding one or many new 
shiny OSDs".

It certainly does scare me, especially as this particular cluster is 
size=2, min_size=1.

My worry is that I could experience a disk failure while adding a new 
OSD and potentially lose data while if the same disk failed when the 
cluster was active+clean I wouldn't. That doesn't seem like a very safe 
design choice but perhaps the real answer is to use size=3.

Reweighting an active OSD to 0 does the same thing on my cluster, causes 
the objects to go degraded instead of misplaced as I'd expect.

Thanks,
Simon.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com