Re: recovery process stops

Harald Rößler <Harald.Roessler@xxxxxx> · Sat, 25 Oct 2014 12:54:14 +0200

Anyone an idea to solver the situation?
Thanks for any advise.

Kind Regards
Harald Rößler

> Am 23.10.2014 um 18:56 schrieb Harald Rößler <Harald.Roessler@xxxxxx>:
>
> @Wido: sorry I don’t understand what you mean 100%, generated some output which may helps.
>
>
> Ok the pool:
>
> pool 3 'bcf' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 832 pgp_num 832 last_change 8000 owner 0
>
>
> all remapping pg have an temp entry:
>
> pg_temp 3.1 [14,20,0]
> pg_temp 3.c [1,7,23]
> pg_temp 3.22 [15,21,23]
>
>
>
> 3.22    429     0       2       0       1654296576      0       0       active+remapped 2014-10-23 03:25:03.180505      8608'363836897  8608'377970131  [15,21] [15,21,23]      3578'354650024  2014-10-16 04:06:39.133104      3578'354650024  2014-10-16 04:06:39.133104
>
> the crush rules.
>
> # rules
> rule data {
>        ruleset 0
>        type replicated
>        min_size 1
>        max_size 10
>        step take default
>        step chooseleaf firstn 0 type host
>        step emit
> }
> rule metadata {
>        ruleset 1
>        type replicated
>        min_size 1
>        max_size 10
>        step take default
>        step chooseleaf firstn 0 type host
>        step emit
> }
> rule rbd {
>        ruleset 2
>        type replicated
>        min_size 1
>        max_size 10
>        step take default
>        step chooseleaf firstn 0 type host
>        step emit
> }
>
>
> ceph pg 3.22 query
>
>
>
>
> { "state": "active+remapped",
>  "epoch": 8608,
>  "up": [
>        15,
>        21],
>  "acting": [
>        15,
>        21,
>        23],
>  "info": { "pgid": "3.22",
>      "last_update": "8608'363845313",
>      "last_complete": "8608'363845313",
>      "log_tail": "8608'363842312",
>      "last_backfill": "MAX",
>      "purged_snaps": "[1~1,3~3,8~6,f~31,42~1,44~3,48~f,58~1,5a~2]",
>      "history": { "epoch_created": 140,
>          "last_epoch_started": 8576,
>          "last_epoch_clean": 8576,
>          "last_epoch_split": 0,
>          "same_up_since": 8340,
>          "same_interval_since": 8575,
>          "same_primary_since": 7446,
>          "last_scrub": "3578'354650024",
>          "last_scrub_stamp": "2014-10-16 04:06:39.133104",
>          "last_deep_scrub": "3578'354650024",
>          "last_deep_scrub_stamp": "2014-10-16 04:06:39.133104",
>          "last_clean_scrub_stamp": "2014-10-16 04:06:39.133104"},
>      "stats": { "version": "8608'363845313",
>          "reported": "8608'377978685",
>          "state": "active+remapped",
>          "last_fresh": "2014-10-23 18:55:07.582844",
>          "last_change": "2014-10-23 03:25:03.180505",
>          "last_active": "2014-10-23 18:55:07.582844",
>          "last_clean": "2014-10-20 07:51:21.330669",
>          "last_became_active": "2013-07-14 07:20:30.173508",
>          "last_unstale": "2014-10-23 18:55:07.582844",
>          "mapping_epoch": 8370,
>          "log_start": "8608'363842312",
>          "ondisk_log_start": "8608'363842312",
>          "created": 140,
>          "last_epoch_clean": 8576,
>          "parent": "0.0",
>          "parent_split_bits": 0,
>          "last_scrub": "3578'354650024",
>          "last_scrub_stamp": "2014-10-16 04:06:39.133104",
>          "last_deep_scrub": "3578'354650024",
>          "last_deep_scrub_stamp": "2014-10-16 04:06:39.133104",
>          "last_clean_scrub_stamp": "2014-10-16 04:06:39.133104",
>          "log_size": 0,
>          "ondisk_log_size": 0,
>          "stats_invalid": "0",
>          "stat_sum": { "num_bytes": 1654296576,
>              "num_objects": 429,
>              "num_object_clones": 28,
>              "num_object_copies": 0,
>              "num_objects_missing_on_primary": 0,
>              "num_objects_degraded": 0,
>              "num_objects_unfound": 0,
>              "num_read": 8053865,
>              "num_read_kb": 124022900,
>              "num_write": 363844886,
>              "num_write_kb": 2083536824,
>              "num_scrub_errors": 0,
>              "num_shallow_scrub_errors": 0,
>              "num_deep_scrub_errors": 0,
>              "num_objects_recovered": 2777,
>              "num_bytes_recovered": 11138282496,
>              "num_keys_recovered": 0},
>          "stat_cat_sum": {},
>          "up": [
>                15,
>                21],
>          "acting": [
>                15,
>                21,
>                23]},
>      "empty": 0,
>      "dne": 0,
>      "incomplete": 0,
>      "last_epoch_started": 8576},
>  "recovery_state": [
>        { "name": "Started\/Primary\/Active",
>          "enter_time": "2014-10-23 03:25:03.179759",
>          "might_have_unfound": [],
>          "recovery_progress": { "backfill_target": -1,
>              "waiting_on_backfill": 0,
>              "backfill_pos": "0\/\/0\/\/-1",
>              "backfill_info": { "begin": "0\/\/0\/\/-1",
>                  "end": "0\/\/0\/\/-1",
>                  "objects": []},
>              "peer_backfill_info": { "begin": "0\/\/0\/\/-1",
>                  "end": "0\/\/0\/\/-1",
>                  "objects": []},
>              "backfills_in_flight": [],
>              "pull_from_peer": [],
>              "pushing": []},
>          "scrub": { "scrubber.epoch_start": "0",
>              "scrubber.active": 0,
>              "scrubber.block_writes": 0,
>              "scrubber.finalizing": 0,
>              "scrubber.waiting_on": 0,
>              "scrubber.waiting_on_whom": []}},
>        { "name": "Started",
>          "enter_time": "2014-10-23 03:25:02.174216"}]}
>
>
>> Am 23.10.2014 um 17:36 schrieb Wido den Hollander <wido@xxxxxxxx>:
>>
>> On 10/23/2014 05:33 PM, Harald Rößler wrote:
>>> Hi all
>>>
>>> the procedure does not work for me, have still 47 active+remapped pg. Anyone have an idea how to fix this issue.
>>
>> If you look at those PGs using "ceph osd pg dump", what is their prefix?
>>
>> They should start with a number and that number corresponds back to a
>> pool ID which you can see with "ceph osd dump|grep pool"
>>
>> Could it be that that specific pool is using a special crush rule?
>>
>> Wido
>>
>>> @Wido: now my cluster have a usage less than 80% - thanks for your advice.
>>>
>>> Harry
>>>
>>>
>>> Am 21.10.2014 um 22:38 schrieb Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx<mailto:clewis@xxxxxxxxxxxxxxxxxx>>:
>>>
>>> In that case, take a look at ceph pg dump | grep remapped.  In the up or active column, there should be one or two common OSDs between the stuck PGs.
>>>
>>> Try restarting those OSD daemons.  I've had a few OSDs get stuck scheduling recovery, particularly around toofull situations.
>>>
>>> I've also had Robert's experience of stuck operations becoming unstuck over night.
>>>
>>>
>>> On Tue, Oct 21, 2014 at 12:02 PM, Harald Rößler <Harald.Roessler@xxxxxx<mailto:Harald.Roessler@xxxxxx>> wrote:
>>> After more than 10 hours the same situation, I don’t think it will fix self over time. How I can find out what is the problem.
>>>
>>>
>>> Am 21.10.2014 um 17:28 schrieb Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx<mailto:clewis@xxxxxxxxxxxxxxxxxx>>:
>>>
>>> That will fix itself over time.  remapped just means that Ceph is moving the data around.  It's normal to see PGs in the remapped and/or backfilling state after OSD restarts.
>>>
>>> They should go down steadily over time.  How long depends on how much data is in the PGs, how fast your hardware is, how many OSDs are affected, and how much you allow recovery to impact cluster performance.  Mine currently take about 20 minutes per PG.  If all 47 are on the same OSD, it'll be a while.  If they're evenly split between multiple OSDs, parallelism will speed that up.
>>>
>>> On Tue, Oct 21, 2014 at 1:22 AM, Harald Rößler <Harald.Roessler@xxxxxx<mailto:Harald.Roessler@xxxxxx>> wrote:
>>> Hi all,
>>>
>>> thank you for your support, now the file system is not degraded any more. Now I have a minus degrading :-)
>>>
>>> 2014-10-21 10:15:22.303139 mon.0 [INF] pgmap v43376478: 3328 pgs: 3281 active+clean, 47 active+remapped; 1609 GB data, 5022 GB used, 1155 GB / 6178 GB avail; 8034B/s rd, 3548KB/s wr, 161op/s; -1638/1329293 degraded (-0.123%)
>>>
>>> but ceph reports me a health HEALTH_WARN 47 pgs stuck unclean; recovery -1638/1329293 degraded (-0.123%)
>>>
>>> I think this warning is reported because there are 47 active+remapped objects, some ideas how to fix that now?
>>>
>>> Kind Regards
>>> Harald Roessler
>>>
>>>
>>> Am 21.10.2014 um 01:03 schrieb Craig Lewis <clewis@xxxxxxxxxxxxxxxxxx<mailto:clewis@xxxxxxxxxxxxxxxxxx>>:
>>>
>>> I've been in a state where reweight-by-utilization was deadlocked (not the daemons, but the remap scheduling).  After successive osd reweight commands, two OSDs wanted to swap PGs, but they were both toofull.  I ended up temporarily increasing mon_osd_nearfull_ratio to 0.87.  That removed the impediment, and everything finished remapping.  Everything went smoothly, and I changed it back when all the remapping finished.
>>>
>>> Just be careful if you need to get close to mon_osd_full_ratio.  Ceph does greater-than on these percentages, not greater-than-equal.  You really don't want the disks to get greater-than mon_osd_full_ratio, because all external IO will stop until you resolve that.
>>>
>>>
>>> On Mon, Oct 20, 2014 at 10:18 AM, Leszek Master <keksior@xxxxxxxxx<mailto:keksior@xxxxxxxxx>> wrote:
>>> You can set lower weight on full osds, or try changing the osd_near_full_ratio parameter in your cluster from 85 to for example 89. But i don't know what can go wrong when you do that.
>>>
>>>
>>> 2014-10-20 17:12 GMT+02:00 Wido den Hollander <wido@xxxxxxxx<mailto:wido@xxxxxxxx>>:
>>> On 10/20/2014 05:10 PM, Harald Rößler wrote:
>>>> yes, tomorrow I will get the replacement of the failed disk, to get a new node with many disk will take a few days.
>>>> No other idea?
>>>>
>>>
>>> If the disks are all full, then, no.
>>>
>>> Sorry to say this, but it came down to poor capacity management. Never
>>> let any disk in your cluster fill over 80% to prevent these situations.
>>>
>>> Wido
>>>
>>>> Harald Rößler
>>>>
>>>>
>>>>> Am 20.10.2014 um 16:45 schrieb Wido den Hollander <wido@xxxxxxxx<mailto:wido@xxxxxxxx>>:
>>>>>
>>>>> On 10/20/2014 04:43 PM, Harald Rößler wrote:
>>>>>> Yes, I had some OSD which was near full, after that I tried to fix the problem with "ceph osd reweight-by-utilization", but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at "degraded ~ 13%“ and stops at 7%.
>>>>>> Honestly I am scared in the moment I am doing the wrong operation.
>>>>>>
>>>>>
>>>>> Any chance of adding a new node with some fresh disks? Seems like you
>>>>> are operating on the storage capacity limit of the nodes and that your
>>>>> only remedy would be adding more spindles.
>>>>>
>>>>> Wido
>>>>>
>>>>>> Regards
>>>>>> Harald Rößler
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Am 20.10.2014 um 14:51 schrieb Wido den Hollander <wido@xxxxxxxx<mailto:wido@xxxxxxxx>>:
>>>>>>>
>>>>>>> On 10/20/2014 02:45 PM, Harald Rößler wrote:
>>>>>>>> Dear All
>>>>>>>>
>>>>>>>> I have in them moment a issue with my cluster. The recovery process stops.
>>>>>>>>
>>>>>>>
>>>>>>> See this: 2 active+degraded+remapped+backfill_toofull
>>>>>>>
>>>>>>> 156 pgs backfill_toofull
>>>>>>>
>>>>>>> You have one or more OSDs which are to full and that causes recovery to
>>>>>>> stop.
>>>>>>>
>>>>>>> If you add more capacity to the cluster recovery will continue and finish.
>>>>>>>
>>>>>>>> ceph -s
>>>>>>>> health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%)
>>>>>>>> monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0<http://10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0>}, election epoch 332, quorum 0,1,2 0,12,6
>>>>>>>> osdmap e6748: 24 osds: 23 up, 23 in
>>>>>>>> pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%)
>>>>>>>>
>>>>>>>>
>>>>>>>> I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster.
>>>>>>>>
>>>>>>>> Have someone any idea
>>>>>>>>
>>>>>>>> Kind Regards
>>>>>>>> Harald Rößler
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Wido den Hollander
>>>>>>> Ceph consultant and trainer
>>>>>>> 42on B.V.
>>>>>>>
>>>>>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>>>>>> Skype: contact42on
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Wido den Hollander
>>>>> Ceph consultant and trainer
>>>>> 42on B.V.
>>>>>
>>>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>>>> Skype: contact42on
>>>>
>>>
>>>
>>> --
>>> Wido den Hollander
>>> Ceph consultant and trainer
>>> 42on B.V.
>>>
>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>> Skype: contact42on
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> 2014-10-20 17:12 GMT+02:00 Wido den Hollander <wido@xxxxxxxx<mailto:wido@xxxxxxxx>>:
>>> On 10/20/2014 05:10 PM, Harald Rößler wrote:
>>>> yes, tomorrow I will get the replacement of the failed disk, to get a new node with many disk will take a few days.
>>>> No other idea?
>>>>
>>>
>>> If the disks are all full, then, no.
>>>
>>> Sorry to say this, but it came down to poor capacity management. Never
>>> let any disk in your cluster fill over 80% to prevent these situations.
>>>
>>> Wido
>>>
>>>> Harald Rößler
>>>>
>>>>
>>>>> Am 20.10.2014 um 16:45 schrieb Wido den Hollander <wido@xxxxxxxx<mailto:wido@xxxxxxxx>>:
>>>>>
>>>>> On 10/20/2014 04:43 PM, Harald Rößler wrote:
>>>>>> Yes, I had some OSD which was near full, after that I tried to fix the problem with "ceph osd reweight-by-utilization", but this does not help. After that I set the near full ratio to 88% with the idea that the remapping would fix the issue. Also a restart of the OSD doesn’t help. At the same time I had a hardware failure of on disk. :-(. After that failure the recovery process start at "degraded ~ 13%“ and stops at 7%.
>>>>>> Honestly I am scared in the moment I am doing the wrong operation.
>>>>>>
>>>>>
>>>>> Any chance of adding a new node with some fresh disks? Seems like you
>>>>> are operating on the storage capacity limit of the nodes and that your
>>>>> only remedy would be adding more spindles.
>>>>>
>>>>> Wido
>>>>>
>>>>>> Regards
>>>>>> Harald Rößler
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Am 20.10.2014 um 14:51 schrieb Wido den Hollander <wido@xxxxxxxx<mailto:wido@xxxxxxxx>>:
>>>>>>>
>>>>>>> On 10/20/2014 02:45 PM, Harald Rößler wrote:
>>>>>>>> Dear All
>>>>>>>>
>>>>>>>> I have in them moment a issue with my cluster. The recovery process stops.
>>>>>>>>
>>>>>>>
>>>>>>> See this: 2 active+degraded+remapped+backfill_toofull
>>>>>>>
>>>>>>> 156 pgs backfill_toofull
>>>>>>>
>>>>>>> You have one or more OSDs which are to full and that causes recovery to
>>>>>>> stop.
>>>>>>>
>>>>>>> If you add more capacity to the cluster recovery will continue and finish.
>>>>>>>
>>>>>>>> ceph -s
>>>>>>>> health HEALTH_WARN 188 pgs backfill; 156 pgs backfill_toofull; 4 pgs backfilling; 55 pgs degraded; 49 pgs recovery_wait; 297 pgs stuck unclean; recovery 111487/1488290 degraded (7.491%)
>>>>>>>> monmap e2: 3 mons at {0=10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0<http://10.99.10.10:6789/0,12=10.99.10.22:6789/0,6=10.99.10.16:6789/0>}, election epoch 332, quorum 0,1,2 0,12,6
>>>>>>>> osdmap e6748: 24 osds: 23 up, 23 in
>>>>>>>> pgmap v43314672: 3328 pgs: 3031 active+clean, 43 active+remapped+wait_backfill, 3 active+degraded+wait_backfill, 96 active+remapped+wait_backfill+backfill_toofull, 31 active+recovery_wait, 19 active+degraded+wait_backfill+backfill_toofull, 36 active+remapped, 3 active+remapped+backfilling, 18 active+remapped+backfill_toofull, 6 active+degraded+remapped+wait_backfill, 15 active+recovery_wait+remapped, 21 active+degraded+remapped+wait_backfill+backfill_toofull, 1 active+recovery_wait+degraded, 1 active+degraded+remapped+backfilling, 2 active+degraded+remapped+backfill_toofull, 2 active+recovery_wait+degraded+remapped; 1698 GB data, 5206 GB used, 971 GB / 6178 GB avail; 24382B/s rd, 12411KB/s wr, 320op/s; 111487/1488290 degraded (7.491%)
>>>>>>>>
>>>>>>>>
>>>>>>>> I have tried to restart all OSD in the cluster, but does not help to finish the recovery of the cluster.
>>>>>>>>
>>>>>>>> Have someone any idea
>>>>>>>>
>>>>>>>> Kind Regards
>>>>>>>> Harald Rößler
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Wido den Hollander
>>>>>>> Ceph consultant and trainer
>>>>>>> 42on B.V.
>>>>>>>
>>>>>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>>>>>> Skype: contact42on
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list
>>>>>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Wido den Hollander
>>>>> Ceph consultant and trainer
>>>>> 42on B.V.
>>>>>
>>>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>>>> Skype: contact42on
>>>>
>>>
>>>
>>> --
>>> Wido den Hollander
>>> Ceph consultant and trainer
>>> 42on B.V.
>>>
>>> Phone: +31 (0)20 700 9902<tel:%2B31%20%280%2920%20700%209902>
>>> Skype: contact42on
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx<mailto:ceph-users@xxxxxxxxxxxxxx>
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Wido den Hollander
>> 42on B.V.
>> Ceph trainer and consultant
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com