lost osd while migrating EC pool to device-class crush rules

Graham Allan <gta@xxxxxxx> · Thu, 13 Sep 2018 17:05:56 -0500

I'm now following up to my earlier message regarding data migration from 
old to new hardware in our ceph cluster. As part of this we wanted to 
move to device-class-based crush rules. For the replicated pools the 
directions for this were straightforward; for our EC pool, it wasn't so 
clear, but I proceeded via the outline below: defining a new EC profile 
and associated crush rule, then finally, last Wednesday (5 Sept) 
updating the crush rule for the pool.

Data migration started well for the first 36 or so hours, then the 
storage backend became very unstable with OSDs dropping out and failing 
to peer... capped off by the mons running out of disk space. After 
correcting the mon disk issue, stability seemed to be restored by 
dropping down max backfills and a few other parameters, eg

    osd max backfills = 4
    osd recovery max active = 1
    osd max recovery threads = 1

I also had to stop then restart all OSDs to stop the peering storm, but 
the data migration has been proceeding fine since then.

However I do see transfer errors fetching some files out of radosgw - 
the transfer just hangs then aborts. I'd guess this probably due to one 
pg stuck down, due to a lost (failed HDD) osd. I think there is no 
alternative to declare the osd lost, but I wish I understood better the 
implications of the "recovery_state" and "past_intervals" output by ceph 
pg query: https://pastebin.com/8WrYLwVt

I find it disturbing/odd that the acting set of osds lists only 3/6 
available; implies that without getting one of these back it would be 
impossible to recover the data (from 4+2 EC). However the dead osd 98 
only appears in the most recent (?) interval - presumably during the 
flapping period, during which time client writes were unlikely (radosgw 
disabled).

So if 98 were marked lost would it roll back to the prior interval? I am 
not certain how to interpret this information!

Running luminous 12.2.7 if it makes a difference.

Thanks as always for pointers,

Graham

On 07/18/2018 02:35 PM, Graham Allan wrote:
Like many, we have a typical double root crush map, for hdd vs ssd-based 
pools. We've been running luminous for some time, so in preparation for a 
migration to new storage hardware, I wanted to migrate our pools to use 
the new device-class based rules; this way I shouldn't need to 
perpetuate the double hdd/ssd crush map for new hardware...

I understand how to migrate our replicated pools, by creating new 
replicated crush rules, and migrating them one at a time, but I'm 
confused on how to do this for erasure pools.

I can create a new class-aware EC profile something like:

ceph osd erasure-code-profile set ecprofile42_hdd k=4 m=2 
crush-device-class=hdd crush-failure-domain=host

then a new crush rule from this:

ceph osd crush rule create-erasure ec42_hdd ecprofile42_hdd

So mostly I want to confirm that is is safe to change the crush rule for 
the EC pool. It seems to make sense, but then, as I understand it, you 
can't change the erasure code profile for a pool after creation; but 
this seems to implicitly do so...

old rule:
rule .rgw.buckets.ec42 {
        id 17
        type erasure
        min_size 3
        max_size 20
        step set_chooseleaf_tries 5
        step take platter
        step chooseleaf indep 0 type host
        step emit
}

old ec profile:
# ceph osd erasure-code-profile get ecprofile42
crush-failure-domain=host
directory=/usr/lib/x86_64-linux-gnu/ceph/erasure-code
k=4
m=2
plugin=jerasure
technique=reed_sol_van

new rule:
rule ec42_hdd {
        id 7
        type erasure
        min_size 3
        max_size 6
        step set_chooseleaf_tries 5
        step set_choose_tries 100
        step take default class hdd
        step chooseleaf indep 0 type host
        step emit
}

new ec profile:
# ceph osd erasure-code-profile get ecprofile42_hdd
crush-device-class=hdd
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=4
m=2
plugin=jerasure
technique=reed_sol_van
w=8

These are both ec42 but I'm not sure why the old rule has "max size 20" 
(perhaps because it was generated a long time ago under hammer?).

Thanks for any feedback,

Graham

--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com