Re: recovery process stops

Wido den Hollander <wido@xxxxxxxx> · Mon, 20 Oct 2014 16:45:16 +0200

On 10/20/2014 04:43 PM, Harald Rößler wrote:
> Yes, I had some OSD which was near full, after that I tried to fix the problem with "ceph osd reweight-by-utilization", but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at "degraded ~ 13%“ and stops at 7%.
> Honestly I am scared in the moment I am doing the wrong operation.
> 

Any chance of adding a new node with some fresh disks? Seems like you
are operating on the storage capacity limit of the nodes and that your
only remedy would be adding more spindles.

Wido

> Regards
> Harald Rößler	
>  
> 
> 
>> Am 20.10.2014 um 14:51 schrieb Wido den Hollander <wido@xxxxxxxx>:
>>
>> On 10/20/2014 02:45 PM, Harald Rößler wrote:
>>> Dear All
>>>
>>> I have in them moment a issue with my cluster. The recovery process stops.
>>>
>>
>> See this: 2 active+degraded+remapped+backfill_toofull
>>
>> 156 pgs backfill_toofull
>>
>> You have one or more OSDs which are to full and that causes recovery to
>> stop.
>>
>> If you add more capacity to the cluster recovery will continue and finish.
>>
>>> ceph -s
>>>   health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%)
>>>   monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0}, election epoch 332, quorum 0,1,2 0,12,6
>>>   osdmap e6748: 24 osds: 23 up, 23 in
>>>    pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%)
>>>
>>>
>>> I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster.
>>>
>>> Have someone any idea
>>>
>>> Kind Regards
>>> Harald Rößler	
>>>
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> -- 
>> Wido den Hollander
>> Ceph consultant and trainer
>> 42on B.V.
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com