Re: ceph Nautilus lost two disk over night everything hangs

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 30 Mar 2021 13:32:47 +0200

Hi,

On 30.03.21 13:05, Rainer Krienke wrote:
Hello,

yes your assumptions are correct pxa-rbd ist the metadata pool for 
pxa-ec which uses a erasure coding 4+2 profile.

In the last hours ceph repaired most of the damage. One inactive PG 
remained and in ceph health detail then told me:

---------
HEALTH_WARN Reduced data availability: 1 pg inactive, 1 pg incomplete; 
15 daemons have recently crashed; 150 slow ops, oldest one blocked for 
26716 sec, daemons [osd.60,osd.67] have slow ops.
PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
    pg 36.15b is remapped+incomplete, acting 
[60,2147483647,23,96,2147483647,36] (reducing pool pxa-ec min_size 
from 5 may help; search ceph.com/docs for 'incomplete')

*snipsnap*

2147483647 is (uint32)(-1), which mean no associated OSD. So this PG 
does not have six independent OSDs, and no backfilling is happening 
since there are no targets to backfill.

You mentioned 9 hosts, so if you use a simple host based crush rule ceph 
should be able to find new OSDs for that PG. If you do not use standard 
crush rules please check that ceph is able to derive enough OSDs to 
satisfy the PG requirements (six different OSDs).

The 'incomplete' part might be a problem. If just a chunk would be 
missing, the state should be undersized, not incomplete...

Regards,

Burkhard

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx