Re: Recovery took too long on cuttlefish

Andrey Korolyov <andrey@xxxxxxx> · Wed, 13 Nov 2013 23:48:32 +0400

In attached file I added two slices of degraded PGs for a first example
and they belongs to completely different sets of OSD. I had to report
that lowering
'osd recovery delay start'
to default 15s value increased recovery speed a lot but documentation
says that is should affect only immediate post-peering behaviour (at
least in my understanding). I wonder why it affects regular recovery
procedure where is no place for remapping and corresponding peering events.

On 11/13/2013 09:51 PM, Gregory Farnum wrote:
> How did you generate these scenarios? At first glance it looks to me
> like you've got very low limits set on how many PGs an OSD can be
> recovering at once, and in the first example they were all targeted to
> that one OSD, while in the second they were distributed.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
> On Wed, Nov 13, 2013 at 3:00 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
>> Hello,
>>
>> Using 5c65e1ee3932a021cfd900a74cdc1d43b9103f0f with
>> large amount of data commit and relatively low PG rate,
>> I`ve observed unexplainable long recovery times for PGs
>> even if the degraded object count is almost zero:
>>
>> 04:44:42.521896 mon.0 [INF] pgmap v24807947: 2048 pgs: 911 active+clean,
>> 1131 active+recovery_wait, 6 active+recovering; 5389 GB data, 16455 GB
>> used, 87692 GB / 101 TB avail; 5839KB/s rd, 2986KB/s wr, 567op/s;
>> 865/4162926 degraded (0.021%);  recovering 2 o/s, 9251KB/s
>>
>> at this moment we have freshly restarted cluster and large amount of PGs
>> in recovery_wait state; after a couple of minutes picture changes a little:
>>
>> 2013-11-13 05:30:18.020093 mon.0 [INF] pgmap v24809483: 2048 pgs: 939
>> active+clean, 1105 active+recovery_wait, 4 active+recovering; 5394 GB
>> data, 16472 GB used, 87676 GB / 101 TB avail; 1627KB/s rd, 3866KB/s wr,
>> 1499op/s; 2456/4167201 degraded (0.059%)
>>
>> and after a couple of hours we`re reaching a peak by degraded objects,
>> PGs still moving to active+clean:
>>
>> 2013-11-13 10:05:36.245917 mon.0 [INF] pgmap v24816326: 2048 pgs: 1191
>> active+clean, 854 active+recovery_wait, 3 active+recovering; 5467 GB
>> data, 16690 GB used, 87457 GB / 101 TB avail; 16339KB/s rd, 18006KB/s
>> wr, 16025op/s; 23495/4223061 degraded (0.556%)
>>
>> After peak was passed, object count starts to decrease and seems cluster
>> will reach completely clean state in next ten hours.
>>
>> For example, with PG count ten times higher recovery goes way faster:
>>
>> 2013-11-05 03:20:56.330767 mon.0 [INF] pgmap v24143721: 27648 pgs: 26171
>> active+clean, 1474 active+recovery_wait, 3 active+recovering; 7855 GB
>> data, 25609 GB used, 78538 GB / 101 TB avail; 3
>> 298KB/s rd, 7746KB/s wr, 3581op/s; 183/6554634 degraded (0.003%)
>>
>> 2013-11-05 04:04:55.779345 mon.0 [INF] pgmap v24145291: 27648 pgs: 27646
>> active+clean, 1 active+recovery_wait, 1 active+recovering; 7857 GB data,
>> 25615 GB used, 78533 GB / 101 TB avail; 999KB/s rd, 690KB/s wr, 563op/s
>>
>> Recovery and backfill settings was the same during all tests:
>>   "osd_max_backfills": "1",
>>   "osd_backfill_full_ratio": "0.85",
>>   "osd_backfill_retry_interval": "10",
>>   "osd_recovery_threads": "1",
>>   "osd_recover_clone_overlap": "true",
>>   "osd_backfill_scan_min": "64",
>>   "osd_backfill_scan_max": "512",
>>   "osd_recovery_thread_timeout": "30",
>>   "osd_recovery_delay_start": "300",
>>   "osd_recovery_max_active": "5",
>>   "osd_recovery_max_chunk": "8388608",
>>   "osd_recovery_forget_lost_objects": "false",
>>   "osd_kill_backfill_at": "0",
>>   "osd_debug_skip_full_check_in_backfill_reservation": "false",
>>   "osd_recovery_op_priority": "10",
>>
>>
>> Also during recovery some heartbeats may miss, it is not related to the
>> current situation but observed for a very long time(for now, seems
>> four-seconds delays between heartbeats distributed almost randomly over
>> a time flow):
>>
>> 2013-11-13 14:57:11.316459 mon.0 [INF] pgmap v24826822: 2048 pgs: 1513
>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>> data, 16708 GB used, 87440 GB / 101 TB avail; 16098KB/s rd, 4085KB/s wr,
>> 623op/s; 15670/4227330 degraded (0.371%)
>> 2013-11-13 14:57:12.328538 mon.0 [INF] pgmap v24826823: 2048 pgs: 1513
>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>> data, 16708 GB used, 87440 GB / 101 TB avail; 3806KB/s rd, 3446KB/s wr,
>> 284op/s; 15670/4227330 degraded (0.371%)
>> 2013-11-13 14:57:13.336618 mon.0 [INF] pgmap v24826824: 2048 pgs: 1513
>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>> data, 16708 GB used, 87440 GB / 101 TB avail; 11051KB/s rd, 12171KB/s
>> wr, 1470op/s; 15670/4227330 degraded (0.371%)
>> 2013-11-13 14:57:16.317271 mon.0 [INF] pgmap v24826825: 2048 pgs: 1513
>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>> data, 16708 GB used, 87440 GB / 101 TB avail; 3610KB/s rd, 3171KB/s wr,
>> 1820op/s; 15670/4227330 degraded (0.371%)
>> 2013-11-13 14:57:17.366554 mon.0 [INF] pgmap v24826826: 2048 pgs: 1513
>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>> data, 16708 GB used, 87440 GB / 101 TB avail; 11323KB/s rd, 1759KB/s wr,
>> 13195op/s; 15670/4227330 degraded (0.371%)
>> 2013-11-13 14:57:18.379340 mon.0 [INF] pgmap v24826827: 2048 pgs: 1513
>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
>> data, 16708 GB used, 87440 GB / 101 TB avail; 38113KB/s rd, 7461KB/s wr,
>> 46511op/s; 15670/4227330 degraded (0.371%)
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Attachment:
pg-dump.txt.gz

Description: GNU Zip compressed data
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com