Re: Reduced data availability: 4 pgs inactive, 4 pgs incomplete

Stefan Kooman <stefan@xxxxxx> · Fri, 5 Jan 2018 20:58:23 +0100

Quoting Brent Kennedy (bkennedy@xxxxxxxxxx):
> Unfortunately, this cluster was setup before the calculator was in
> place and when the equation was not well understood.  We have the
> storage space to move the pools and recreate them, which was
> apparently the only way to handle the issue( you are suggesting what
> appears to be a different approach ).  I was hoping to avoid doing all
> of this because the migration would be very time consuming.  There is
> no way to fix the stuck pg’s though?  If I were to expand the
> replication to 3 instances, would that help with the PGs per OSD issue
> any?
No! It will make the problem worse because you need PGs to host these
copies. The more replicas, the more PGs you need.

> The math was originally based on 3 not the current 2.  Sounds
> like it may change to 300 max which may not be helpful…

> When you say enforce, do you mean it will block all access to the cluster/OSDs?

No, you will not be able to increase the number of PGs on the pool.
> 
> We have upgraded from Hammer to Jewel and then Luminous 12.2.2 as of
> today.  During the hammer upgrade to Jewel we lost two host servers
> and let the cluster rebalance/recover, it ran out of space and
> stalled.  We then added three new host servers and then let the
> cluster rebalance/recover. During that process, at some point, we
> ended up with 4 pgs not being able to be repaired using “ceph pg
> repair xx.xx”.  I tried using ceph pg 11.720 query and from what I can
> tell the missing information matches, but is being blocked from being
> marked clean.  I keep seeing references to the ceph-object-store tool
> to use as an export/restore method, but I cannot find details on a
> step by step process given the current predicament.  It may also be
> possible for us to just lose the data if it cant be extracted so we
> can at least return the cluster to a healthy state.  Any thoughts?

What is the output of "ceph daemon osd.$ID config show | grep
osd_allow_recovery_below_min_size

If you are below min_size recovery will not complete when that setting
is not true. Maybe this thread is interesting:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-October/005613.html

Especially when a OSD is candidate for backfilling a target but does not
contain any data.

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info@xxxxxx
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com