Re: lost osd while migrating EC pool to device-class crush rules

Graham Allan <gta@xxxxxxx> · Tue, 18 Sep 2018 17:24:02 -0500

On 09/17/2018 04:33 PM, Gregory Farnum wrote:
On Mon, Sep 17, 2018 at 8:21 AM Graham Allan <gta@xxxxxxx 
<mailto:gta@xxxxxxx>> wrote:

    Looking back through history it seems that I *did* override the
    min_size
    for this pool, however I didn't reduce it - it used to have min_size 2!
    That made no sense to me - I think it must be an artifact of a very
    early (hammer?) ec pool creation, but it pre-dates me.

    I found the documentation on what min_size should be a bit confusing
    which is how I arrived at 4. Fully agree that k+1=5 makes way more
    sense.

    I don't think I was the only one confused by this though, eg
    http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026445.html

    I suppose the safest thing to do is update min_size->5 right away to
    force any size-4 pgs down until they can perform recovery. I can set
    force-recovery on these as well...

Mmm, this is embarrassing but that actually doesn't quite work due to 
https://github.com/ceph/ceph/pull/24095, which has been on my task list 
but at the bottom for a while. :( So if your cluster is stable now I'd 
let it clean up and then change the min_size once everything is repaired.

Thanks for your feedback, Greg. Since declaring the dead osd as lost, 
the downed pg became active again, and is successfully serving data. The 
cluster is considerably more stable now; I've set force-backfill or 
force-recovery on any size=4 pgs and can wait for that to complete 
before changing anything else.

Thanks again,

Graham
--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com