Wild guess: you hit the PG hard limit, how many PGs per OSD do you have?
If this is the case: increase "osd max pg per osd hard ratio"
Check "ceph pg <pgid> query" to see why it isn't activating.
Can you share the output of "ceph osd df tree" and "ceph pg <pgid> query" of the affected PGs?
Paul
--
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
Paul Emmerich
Looking for help with your Ceph cluster? Contact us at https://croit.io
croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
On Wed, Jun 19, 2019 at 8:52 AM Lars Täuber <taeuber@xxxxxxx> wrote:
Hi there!
Recently I made our cluster rack aware
by adding racks to the crush map.
The failure domain was and still is "host".
rule cephfs2_data {
id 7
type erasure
min_size 3
max_size 6
step set_chooseleaf_tries 5
step set_choose_tries 100
step take PRZ
step chooseleaf indep 0 type host
step emit
Then I sorted the hosts into the new
rack buckets of the crush map as they
are in reality, by:
# osd crush move onodeX rack=XYZ
for all hosts.
The cluster started to reorder the data.
In the end the cluster has now:
HEALTH_WARN 1 filesystem is degraded; Reduced data availability: 2 pgs inactive; Degraded data redundancy: 678/2371785 objects degraded (0.029%), 2 pgs degraded, 2 pgs undersized
FS_DEGRADED 1 filesystem is degraded
fs cephfs_1 is degraded
PG_AVAILABILITY Reduced data availability: 2 pgs inactive
pg 21.2e4 is stuck inactive for 142792.952697, current state activating+undersized+degraded+remapped+forced_backfill, last acting [5,2147483647,25,28,11,2]
pg 23.5 is stuck inactive for 142791.437243, current state activating+undersized+degraded+remapped+forced_backfill, last acting [13,21]
PG_DEGRADED Degraded data redundancy: 678/2371785 objects degraded (0.029%), 2 pgs degraded, 2 pgs undersized
pg 21.2e4 is stuck undersized for 142779.321192, current state activating+undersized+degraded+remapped+forced_backfill, last acting [5,2147483647,25,28,11,2]
pg 23.5 is stuck undersized for 142789.747915, current state activating+undersized+degraded+remapped+forced_backfill, last acting [13,21]
The cluster hosts a cephfs which is
not mountable anymore.
I tried a few things (as you can see:
forced_backfill), but failed.
The cephfs_data pool is EC 4+2.
Both inactive pgs seem to have enough
copies to recalculate the contents for
all osds.
Is there a chance to get both pgs
clean again?
How can I force the pgs to recalculate
all necessary copies?
Thanks
Lars
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com