Re: clust recovery stuck

Philipp Schwaha <philipp@xxxxxxxxxxx> · Tue, 22 Oct 2019 21:02:50 +0200

hi,

On 2019-10-22 08:05, Eugen Block wrote:
> Hi,
> 
> can you share `ceph osd tree`? What crush rules are in use in your
> cluster? I assume that the two failed OSDs prevent the remapping because
> the rules can't be applied.
> 

ceph osd tree gives:

ID WEIGHT   TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 27.94199 root default
-2  9.31400     host alpha.local
 0  4.65700         osd.0           down        0          1.00000
 3  4.65700         osd.3             up  1.00000          1.00000
-3  9.31400     host beta.local
 1  4.65700         osd.1             up  1.00000          1.00000
 6  4.65700         osd.6           down        0          1.00000
-4  9.31400     host gamma.local
 2  4.65700         osd.2             up  1.00000          1.00000
 4  4.65700         osd.4             up  1.00000          1.00000

the crush rules should be fairly simple, nothing particularly customized
as far as I can tell:
'ceph osd crush tree' gives:
[
    {
        "id": -1,
        "name": "default",
        "type": "root",
        "type_id": 10,
        "items": [
            {
                "id": -2,
                "name": "alpha.local",
                "type": "host",
                "type_id": 1,
                "items": [
                    {
                        "id": 0,
                        "name": "osd.0",
                        "type": "osd",
                        "type_id": 0,
                        "crush_weight": 4.656998,
                        "depth": 2
                    },
                    {
                        "id": 3,
                        "name": "osd.3",
                        "type": "osd",
                        "type_id": 0,
                        "crush_weight": 4.656998,
                        "depth": 2
                    }
                ]
            },
            {
                "id": -3,
                "name": "beta.local",
                "type": "host",
                "type_id": 1,
                "items": [
                    {
                        "id": 1,
                        "name": "osd.1",
                        "type": "osd",
                        "type_id": 0,
                        "crush_weight": 4.656998,
                        "depth": 2
                    },
                    {
                        "id": 6,
                        "name": "osd.6",
                        "type": "osd",
                        "type_id": 0,
                        "crush_weight": 4.656998,
                        "depth": 2
                    }
                ]
            },
            {
                "id": -4,
                "name": "gamma.local",
                "type": "host",
                "type_id": 1,
                "items": [
                    {
                        "id": 2,
                        "name": "osd.2",
                        "type": "osd",
                        "type_id": 0,
                        "crush_weight": 4.656998,
                        "depth": 2
                    },
                    {
                        "id": 4,
                        "name": "osd.4",
                        "type": "osd",
                        "type_id": 0,
                        "crush_weight": 4.656998,
                        "depth": 2
                    }
                ]
            }
        ]
    }
]

and 'ceph osd crush rule dump' gives:
[
    {
        "rule_id": 0,
        "rule_name": "replicated_ruleset",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

the cluster actually reached health ok after osd.0 went down, but when
osd.6 went down it did not recover. the cluster is running ceph version
10.2.2.

any help is greatly appreciated!

thanks & cheers
	Philipp

> 
> Zitat von Philipp Schwaha <philipp@xxxxxxxxxxx>:
> 
>> hi,
>>
>> I have a problem with a cluster being stuck in recovery after osd
>> failure. at first recovery was doing quite well, but now it just sits
>> there without any progress. I currently looks like this:
>>
>>      health HEALTH_ERR
>>             36 pgs are stuck inactive for more than 300 seconds
>>             50 pgs backfill_wait
>>             52 pgs degraded
>>             36 pgs down
>>             36 pgs peering
>>             1 pgs recovering
>>             1 pgs recovery_wait
>>             36 pgs stuck inactive
>>             52 pgs stuck unclean
>>             52 pgs undersized
>>             recovery 261632/2235446 objects degraded (11.704%)
>>             recovery 259813/2235446 objects misplaced (11.622%)
>>             recovery 2/1117723 unfound (0.000%)
>>      monmap e3: 3 mons at
>> {0=192.168.19.13:6789/0,1=192.168.19.17:6789/0,2=192.168.19.23:6789/0}
>>             election epoch 78, quorum 0,1,2 0,1,2
>>      osdmap e7430: 6 osds: 4 up, 4 in; 88 remapped pgs
>>             flags sortbitwise
>>       pgmap v20023893: 256 pgs, 1 pools, 4366 GB data, 1091 kobjects
>>             8421 GB used, 10183 GB / 18629 GB avail
>>             261632/2235446 objects degraded (11.704%)
>>             259813/2235446 objects misplaced (11.622%)
>>             2/1117723 unfound (0.000%)
>>                  168 active+clean
>>                   50 active+undersized+degraded+remapped+wait_backfill
>>                   36 down+remapped+peering
>>                    1 active+recovering+undersized+degraded+remapped
>>                    1 active+recovery_wait+undersized+degraded+remapped
>>
>> Is there any way to motivate it to resume recovery?
>>
>> Thanks
>>     Philipp
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com