Re: lost osd while migrating EC pool to device-class crush rules

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 17 Sep 2018 14:33:22 -0700

On Mon, Sep 17, 2018 at 8:21 AM Graham Allan <gta@xxxxxxx> wrote:

On 09/14/2018 02:38 PM, Gregory Farnum wrote:

> On Thu, Sep 13, 2018 at 3:05 PM, Graham Allan <gta@xxxxxxx> wrote:

>>

>> However I do see transfer errors fetching some files out of radosgw - the

>> transfer just hangs then aborts. I'd guess this probably due to one pg stuck

>> down, due to a lost (failed HDD) osd. I think there is no alternative to

>> declare the osd lost, but I wish I understood better the implications of the

>> "recovery_state" and "past_intervals" output by ceph pg query:

>> https://pastebin.com/8WrYLwVt

> 

> What are you curious about here? The past intervals is listing the

> OSDs which were involved in the PG since it was last clean, then each

> acting set and the intervals it was active for.

That's pretty much what I'm looking for, and that the pg can roll back 

to an earlier interval if there were no writes, and the current osd has 

to be declared lost.

>> I find it disturbing/odd that the acting set of osds lists only 3/6

>> available; implies that without getting one of these back it would be

>> impossible to recover the data (from 4+2 EC). However the dead osd 98 only

>> appears in the most recent (?) interval - presumably during the flapping

>> period, during which time client writes were unlikely (radosgw disabled).

>>

>> So if 98 were marked lost would it roll back to the prior interval? I am not

>> certain how to interpret this information!

> 

> Yes, that’s what should happen if it’s all as you outline here.

> 

> It *is* quite curious that the PG apparently went active with only 4

> members in a 4+2 system — it's supposed to require at least k+1 (here,

> 5) by default. Did you override the min_size or something?

> -Greg

Looking back through history it seems that I *did* override the min_size 

for this pool, however I didn't reduce it - it used to have min_size 2! 

That made no sense to me - I think it must be an artifact of a very 

early (hammer?) ec pool creation, but it pre-dates me.

I found the documentation on what min_size should be a bit confusing 

which is how I arrived at 4. Fully agree that k+1=5 makes way more sense.

I don't think I was the only one confused by this though, eg

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026445.html

I suppose the safest thing to do is update min_size->5 right away to 

force any size-4 pgs down until they can perform recovery. I can set 

force-recovery on these as well...

Mmm, this is embarrassing but that actually doesn't quite work due to https://github.com/ceph/ceph/pull/24095, which has been on my task list but at the bottom for a while. :( So if your cluster is stable now I'd let it clean up and then change the min_size once everything is repaired.

Is there any setting which can permit these pgs to fulfil reads while 

refusing writes when active size=k?

No, that's unfortunately infeasible.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com