Re: lost osd while migrating EC pool to device-class crush rules

Graham Allan <gta@xxxxxxx> · Mon, 17 Sep 2018 10:21:30 -0500

On 09/14/2018 02:38 PM, Gregory Farnum wrote:
On Thu, Sep 13, 2018 at 3:05 PM, Graham Allan <gta@xxxxxxx> wrote:

However I do see transfer errors fetching some files out of radosgw - the
transfer just hangs then aborts. I'd guess this probably due to one pg stuck
down, due to a lost (failed HDD) osd. I think there is no alternative to
declare the osd lost, but I wish I understood better the implications of the
"recovery_state" and "past_intervals" output by ceph pg query:
https://pastebin.com/8WrYLwVt

What are you curious about here? The past intervals is listing the
OSDs which were involved in the PG since it was last clean, then each
acting set and the intervals it was active for.

That's pretty much what I'm looking for, and that the pg can roll back 
to an earlier interval if there were no writes, and the current osd has 
to be declared lost.

I find it disturbing/odd that the acting set of osds lists only 3/6
available; implies that without getting one of these back it would be
impossible to recover the data (from 4+2 EC). However the dead osd 98 only
appears in the most recent (?) interval - presumably during the flapping
period, during which time client writes were unlikely (radosgw disabled).

So if 98 were marked lost would it roll back to the prior interval? I am not
certain how to interpret this information!

Yes, that’s what should happen if it’s all as you outline here.

It *is* quite curious that the PG apparently went active with only 4
members in a 4+2 system — it's supposed to require at least k+1 (here,
5) by default. Did you override the min_size or something?
-Greg

Looking back through history it seems that I *did* override the min_size 
for this pool, however I didn't reduce it - it used to have min_size 2! 
That made no sense to me - I think it must be an artifact of a very 
early (hammer?) ec pool creation, but it pre-dates me.

I found the documentation on what min_size should be a bit confusing 
which is how I arrived at 4. Fully agree that k+1=5 makes way more sense.

I don't think I was the only one confused by this though, eg
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-May/026445.html

I suppose the safest thing to do is update min_size->5 right away to 
force any size-4 pgs down until they can perform recovery. I can set 
force-recovery on these as well...

Is there any setting which can permit these pgs to fulfil reads while 
refusing writes when active size=k?

--
Graham Allan
Minnesota Supercomputing Institute - gta@xxxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com