Re: cluster failing to recover

Oliver Dzombic <info@xxxxxxxxxxxxxxxxx> · Mon, 4 Jul 2016 00:11:22 +0200

Hi,

did you already do something ( replacing drives or changing something ) ?

You have 11 scrub errors, and ~ 11x inconsistent pg's

The inconsistent pg's, for example:

pg 4.3a7 is stuck unclean for 629.766502, current state
active+recovery_wait+degraded+inconsistent, last acting [10,21]

are not on the down osd's 1 and 22

neighter of them.

So the should not be missing. But they are.

Anyway, i think the next step would be to start a pg_repair command and
see where the road goes.

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:info@xxxxxxxxxxxxxxxxx

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107

Am 03.07.2016 um 23:59 schrieb Matyas Koszik:
> 
> Hi,
> 
> I've continued restarting osds in the meantime, and it got somewhat
> better, but still very far from optimal.
> 
> Here're the details you requested:
> 
> http://pastebin.com/Vqgadz24
> 
> http://pastebin.com/vCL6BRvC
> 
> Matyas
> 
> 
> On Sun, 3 Jul 2016, Oliver Dzombic wrote:
> 
>> Hi,
>>
>> please provide:
>>
>> ceph health detail
>>
>> ceph osd tree
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:info@xxxxxxxxxxxxxxxxx
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt )
>> Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 03.07.2016 um 21:36 schrieb Matyas Koszik:
>>>
>>> Hi,
>>>
>>> I recently upgraded to jewel (10.2.2) and now I'm confronted with a rather
>>> strange behavior: recovey does not progress in the way it should. If I
>>> restart the osds on a host, it'll get a bit better (or worse), like this:
>>>
>>> 50 pgs undersized
>>> recovery 43775/7057285 objects degraded (0.620%)
>>> recovery 87980/7057285 objects misplaced (1.247%)
>>>
>>> [restart osds on node1]
>>>
>>> 44 pgs undersized
>>> recovery 39623/7061519 objects degraded (0.561%)
>>> recovery 92142/7061519 objects misplaced (1.305%)
>>>
>>> [restart osds on node1]
>>>
>>> 43 pgs undersized
>>> 1116 requests are blocked > 32 sec
>>> recovery 38181/7061529 objects degraded (0.541%)
>>> recovery 90617/7061529 objects misplaced (1.283%)
>>>
>>> ...
>>>
>>> The current state is this:
>>>
>>>  osdmap e38804: 53 osds: 51 up, 51 in; 66 remapped pgs
>>>   pgmap v14797137: 4388 pgs, 8 pools, 13626 GB data, 3434 kobjects
>>>         27474 GB used, 22856 GB / 50330 GB avail
>>>         38172/7061565 objects degraded (0.541%)
>>>         90617/7061565 objects misplaced (1.283%)
>>>         8/3517300 unfound (0.000%)
>>>             4202 active+clean
>>>              109 active+recovery_wait+degraded
>>>               38 active+undersized+degraded+remapped+wait_backfill
>>>               15 active+remapped+wait_backfill
>>>               11 active+clean+inconsistent
>>>                8 active+recovery_wait+degraded+remapped
>>>                3 active+recovering+undersized+degraded+remapped
>>>                2 active+recovery_wait+undersized+degraded+remapped
>>>
>>>
>>> All the pools have size=2 min_size=1.
>>>
>>> (All the unfound blocks are on undersized pgs, and I cannot seem to be
>>> able to fix them without having replicas (?). They exist, but are
>>> outdated, from an earlier problem.)
>>>
>>>
>>>
>>> Matyas
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com