Re: Recovery took too long on cuttlefish

Igor Lukyanov <igor@xxxxxxxxxxxx> · Thu, 14 Nov 2013 01:07:35 +0400

Please correct me if I'm wrong.
Initially there was supposed that recovery delay
(osd_recovery_delay_start) is brought in only during peering when
recovery_wq is enqueued with the placement groups scheduled for
recovery.

RecoveryWQ::_enqueue(...)
...
 osd->defer_recovery_until = ceph_clock_now(g_ceph_context);
 osd->defer_recovery_until += g_conf->osd_recovery_delay_start;
...

But in fact, recovery_wq is enqueued every time any PG passes from
'recovery_wait' to 'recovering' state.

PG::RecoveryState::Recovering::Recovering(my_context ctx)  : my_base(ctx)
{
  state_name = "Started/Primary/Active/Recovering";
  context< RecoveryMachine >().log_enter(state_name);
  PG *pg = context< RecoveryMachine >().pg;
  pg->state_clear(PG_STATE_RECOVERY_WAIT);
  pg->state_set(PG_STATE_RECOVERING);
  pg->osd->queue_for_recovery(pg);
}

It means that recovery of almost every PG is followed by a break for
'osd_recovery_delay_start' seconds.
So, it would be nice to correct the behavior or name/description of
this config option as currently it is quite misleading: instead of
expected initial delay it slows down the entire recovery process.

2013/11/13 Andrey Korolyov <andrey@xxxxxxx>:
> In attached file I added two slices of degraded PGs for a first example
> and they belongs to completely different sets of OSD. I had to report
> that lowering
> 'osd recovery delay start'
> to default 15s value increased recovery speed a lot but documentation
> says that is should affect only immediate post-peering behaviour (at
> least in my understanding). I wonder why it affects regular recovery
> procedure where is no place for remapping and corresponding peering events.
>
> On 11/13/2013 09:51 PM, Gregory Farnum wrote:
>> How did you generate these scenarios? At first glance it looks to me
>> like you've got very low limits set on how many PGs an OSD can be
>> recovering at once, and in the first example they were all targeted to
>> that one OSD, while in the second they were distributed.
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>>
>> On Wed, Nov 13, 2013 at 3:00 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>>> Hello,
>>>
>>> Using 5c65e1ee3932a021cfd900a74cdc1d43b9103f0f with
>>> large amount of data commit and relatively low PG rate,
>>> I`ve observed unexplainable long recovery times for PGs
>>> even if the degraded object count is almost zero:
>>>
>>> 04:44:42.521896 mon.0 [INF] pgmap v24807947: 2048 pgs: 911 active+clean,
>>> 1131 active+recovery_wait, 6 active+recovering; 5389 GB data, 16455 GB
>>> used, 87692 GB / 101 TB avail; 5839KB/s rd, 2986KB/s wr, 567op/s;
>>> 865/4162926 degraded (0.021%);  recovering 2 o/s, 9251KB/s
>>>
>>> at this moment we have freshly restarted cluster and large amount of PGs
>>> in recovery_wait state; after a couple of minutes picture changes a little:
>>>
>>> 2013-11-13 05:30:18.020093 mon.0 [INF] pgmap v24809483: 2048 pgs: 939
>>> active+clean, 1105 active+recovery_wait, 4 active+recovering; 5394 GB
>>> data, 16472 GB used, 87676 GB / 101 TB avail; 1627KB/s rd, 3866KB/s wr,
>>> 1499op/s; 2456/4167201 degraded (0.059%)
>>>
>>> and after a couple of hours we`re reaching a peak by degraded objects,
>>> PGs still moving to active+clean:
>>>
>>> 2013-11-13 10:05:36.245917 mon.0 [INF] pgmap v24816326: 2048 pgs: 1191
>>> active+clean, 854 active+recovery_wait, 3 active+recovering; 5467 GB
>>> data, 16690 GB used, 87457 GB / 101 TB avail; 16339KB/s rd, 18006KB/s
>>> wr, 16025op/s; 23495/4223061 degraded (0.556%)
>>>
>>> After peak was passed, object count starts to decrease and seems cluster
>>> will reach completely clean state in next ten hours.
>>>
>>> For example, with PG count ten times higher recovery goes way faster:
>>>
>>> 2013-11-05 03:20:56.330767 mon.0 [INF] pgmap v24143721: 27648 pgs: 26171
>>> active+clean, 1474 active+recovery_wait, 3 active+recovering; 7855 GB
>>> data, 25609 GB used, 78538 GB / 101 TB avail; 3
>>> 298KB/s rd, 7746KB/s wr, 3581op/s; 183/6554634 degraded (0.003%)
>>>
>>> 2013-11-05 04:04:55.779345 mon.0 [INF] pgmap v24145291: 27648 pgs: 27646
>>> active+clean, 1 active+recovery_wait, 1 active+recovering; 7857 GB data,
>>> 25615 GB used, 78533 GB / 101 TB avail; 999KB/s rd, 690KB/s wr, 563op/s
>>>
>>> Recovery and backfill settings was the same during all tests:
>>>   "osd_max_backfills": "1",
>>>   "osd_backfill_full_ratio": "0.85",
>>>   "osd_backfill_retry_interval": "10",
>>>   "osd_recovery_threads": "1",
>>>   "osd_recover_clone_overlap": "true",
>>>   "osd_backfill_scan_min": "64",
>>>   "osd_backfill_scan_max": "512",
>>>   "osd_recovery_thread_timeout": "30",
>>>   "osd_recovery_delay_start": "300",
>>>   "osd_recovery_max_active": "5",
>>>   "osd_recovery_max_chunk": "8388608",
>>>   "osd_recovery_forget_lost_objects": "false",
>>>   "osd_kill_backfill_at": "0",
>>>   "osd_debug_skip_full_check_in_backfill_reservation": "false",
>>>   "osd_recovery_op_priority": "10",
>>>
>>>
>>> Also during recovery some heartbeats may miss, it is not related to the
>>> current situation but observed for a very long time(for now, seems
>>> four-seconds delays between heartbeats distributed almost randomly over
>>> a time flow):
>>>
>>> 2013-11-13 14:57:11.316459 mon.0 [INF] pgmap v24826822: 2048 pgs: 1513
>>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>>> data, 16708 GB used, 87440 GB / 101 TB avail; 16098KB/s rd, 4085KB/s wr,
>>> 623op/s; 15670/4227330 degraded (0.371%)
>>> 2013-11-13 14:57:12.328538 mon.0 [INF] pgmap v24826823: 2048 pgs: 1513
>>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>>> data, 16708 GB used, 87440 GB / 101 TB avail; 3806KB/s rd, 3446KB/s wr,
>>> 284op/s; 15670/4227330 degraded (0.371%)
>>> 2013-11-13 14:57:13.336618 mon.0 [INF] pgmap v24826824: 2048 pgs: 1513
>>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>>> data, 16708 GB used, 87440 GB / 101 TB avail; 11051KB/s rd, 12171KB/s
>>> wr, 1470op/s; 15670/4227330 degraded (0.371%)
>>> 2013-11-13 14:57:16.317271 mon.0 [INF] pgmap v24826825: 2048 pgs: 1513
>>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>>> data, 16708 GB used, 87440 GB / 101 TB avail; 3610KB/s rd, 3171KB/s wr,
>>> 1820op/s; 15670/4227330 degraded (0.371%)
>>> 2013-11-13 14:57:17.366554 mon.0 [INF] pgmap v24826826: 2048 pgs: 1513
>>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>>> data, 16708 GB used, 87440 GB / 101 TB avail; 11323KB/s rd, 1759KB/s wr,
>>> 13195op/s; 15670/4227330 degraded (0.371%)
>>> 2013-11-13 14:57:18.379340 mon.0 [INF] pgmap v24826827: 2048 pgs: 1513
>>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>>> data, 16708 GB used, 87440 GB / 101 TB avail; 38113KB/s rd, 7461KB/s wr,
>>> 46511op/s; 15670/4227330 degraded (0.371%)
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com