Re: Recovery took too long on cuttlefish

Stefan Priebe <s.priebe@xxxxxxxxxxxx> · Wed, 13 Nov 2013 20:59:25 +0100

Am 13.11.2013 20:48, schrieb Andrey Korolyov:
In attached file I added two slices of degraded PGs for a first example
and they belongs to completely different sets of OSD. I had to report
that lowering
'osd recovery delay start'
to default 15s value increased recovery speed a lot but documentation
says that is should affect only immediate post-peering behaviour (at
least in my understanding). I wonder why it affects regular recovery
procedure where is no place for remapping and corresponding peering events.

I had the same try these ones under [osd] section:

        osd_recover_clone_overlap = false

        filestore_max_sync_interval = 15
        filestore queue max ops = 500
        filestore_queue_committing_max_ops = 5000
        filestore_queue_max_bytes = 419430400
        filestore_queue_committing_max_bytes = 419430400

  filestore_wbthrottle_xfs_bytes_start_flusher = 125829120
  filestore_wbthrottle_xfs_bytes_hard_limit = 419430400
  filestore_wbthrottle_xfs_ios_start_flusher = 5000
  filestore_wbthrottle_xfs_ios_hard_limit = 50000
  filestore_wbthrottle_xfs_inodes_start_flusher = 1000
  filestore_wbthrottle_xfs_inodes_hard_limit = 10000

On 11/13/2013 09:51 PM, Gregory Farnum wrote:
How did you generate these scenarios? At first glance it looks to me
like you've got very low limits set on how many PGs an OSD can be
recovering at once, and in the first example they were all targeted to
that one OSD, while in the second they were distributed.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

On Wed, Nov 13, 2013 at 3:00 AM, Andrey Korolyov <andrey@xxxxxxx> wrote:
Hello,

Using 5c65e1ee3932a021cfd900a74cdc1d43b9103f0f with
large amount of data commit and relatively low PG rate,
I`ve observed unexplainable long recovery times for PGs
even if the degraded object count is almost zero:

04:44:42.521896 mon.0 [INF] pgmap v24807947: 2048 pgs: 911 active+clean,
1131 active+recovery_wait, 6 active+recovering; 5389 GB data, 16455 GB
used, 87692 GB / 101 TB avail; 5839KB/s rd, 2986KB/s wr, 567op/s;
865/4162926 degraded (0.021%);  recovering 2 o/s, 9251KB/s

at this moment we have freshly restarted cluster and large amount of PGs
in recovery_wait state; after a couple of minutes picture changes a little:

2013-11-13 05:30:18.020093 mon.0 [INF] pgmap v24809483: 2048 pgs: 939
active+clean, 1105 active+recovery_wait, 4 active+recovering; 5394 GB
data, 16472 GB used, 87676 GB / 101 TB avail; 1627KB/s rd, 3866KB/s wr,
1499op/s; 2456/4167201 degraded (0.059%)

and after a couple of hours we`re reaching a peak by degraded objects,
PGs still moving to active+clean:

2013-11-13 10:05:36.245917 mon.0 [INF] pgmap v24816326: 2048 pgs: 1191
active+clean, 854 active+recovery_wait, 3 active+recovering; 5467 GB
data, 16690 GB used, 87457 GB / 101 TB avail; 16339KB/s rd, 18006KB/s
wr, 16025op/s; 23495/4223061 degraded (0.556%)

After peak was passed, object count starts to decrease and seems cluster
will reach completely clean state in next ten hours.

For example, with PG count ten times higher recovery goes way faster:

2013-11-05 03:20:56.330767 mon.0 [INF] pgmap v24143721: 27648 pgs: 26171
active+clean, 1474 active+recovery_wait, 3 active+recovering; 7855 GB
data, 25609 GB used, 78538 GB / 101 TB avail; 3
298KB/s rd, 7746KB/s wr, 3581op/s; 183/6554634 degraded (0.003%)

2013-11-05 04:04:55.779345 mon.0 [INF] pgmap v24145291: 27648 pgs: 27646
active+clean, 1 active+recovery_wait, 1 active+recovering; 7857 GB data,
25615 GB used, 78533 GB / 101 TB avail; 999KB/s rd, 690KB/s wr, 563op/s

Recovery and backfill settings was the same during all tests:
   "osd_max_backfills": "1",
   "osd_backfill_full_ratio": "0.85",
   "osd_backfill_retry_interval": "10",
   "osd_recovery_threads": "1",
   "osd_recover_clone_overlap": "true",
   "osd_backfill_scan_min": "64",
   "osd_backfill_scan_max": "512",
   "osd_recovery_thread_timeout": "30",
   "osd_recovery_delay_start": "300",
   "osd_recovery_max_active": "5",
   "osd_recovery_max_chunk": "8388608",
   "osd_recovery_forget_lost_objects": "false",
   "osd_kill_backfill_at": "0",
   "osd_debug_skip_full_check_in_backfill_reservation": "false",
   "osd_recovery_op_priority": "10",

Also during recovery some heartbeats may miss, it is not related to the
current situation but observed for a very long time(for now, seems
four-seconds delays between heartbeats distributed almost randomly over
a time flow):

2013-11-13 14:57:11.316459 mon.0 [INF] pgmap v24826822: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 16098KB/s rd, 4085KB/s wr,
623op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:12.328538 mon.0 [INF] pgmap v24826823: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 3806KB/s rd, 3446KB/s wr,
284op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:13.336618 mon.0 [INF] pgmap v24826824: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 11051KB/s rd, 12171KB/s
wr, 1470op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:16.317271 mon.0 [INF] pgmap v24826825: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 3610KB/s rd, 3171KB/s wr,
1820op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:17.366554 mon.0 [INF] pgmap v24826826: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 11323KB/s rd, 1759KB/s wr,
13195op/s; 15670/4227330 degraded (0.371%)
2013-11-13 14:57:18.379340 mon.0 [INF] pgmap v24826827: 2048 pgs: 1513
active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB
data, 16708 GB used, 87440 GB / 101 TB avail; 38113KB/s rd, 7461KB/s wr,
46511op/s; 15670/4227330 degraded (0.371%)
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com