Please correct me if I'm wrong. Initially there was supposed that recovery delay (osd_recovery_delay_start) is brought in only during peering when recovery_wq is enqueued with the placement groups scheduled for recovery. RecoveryWQ::_enqueue(...) ... osd->defer_recovery_until = ceph_clock_now(g_ceph_context); osd->defer_recovery_until += g_conf->osd_recovery_delay_start; ... But in fact, recovery_wq is enqueued every time any PG passes from 'recovery_wait' to 'recovering' state. PG::RecoveryState::Recovering::Recovering(my_context ctx) : my_base(ctx) { state_name = "Started/Primary/Active/Recovering"; context< RecoveryMachine >().log_enter(state_name); PG *pg = context< RecoveryMachine >().pg; pg->state_clear(PG_STATE_RECOVERY_WAIT); pg->state_set(PG_STATE_RECOVERING); pg->osd->queue_for_recovery(pg); } It means that recovery of almost every PG is followed by a break for 'osd_recovery_delay_start' seconds. So, it would be nice to correct the behavior or name/description of this config option as currently it is quite misleading: instead of expected initial delay it slows down the entire recovery process. 2013/11/13 Andrey Korolyov <andrey@xxxxxxx>: > In attached file I added two slices of degraded PGs for a first example > and they belongs to completely different sets of OSD. I had to report > that lowering > 'osd recovery delay start' > to default 15s value increased recovery speed a lot but documentation > says that is should affect only immediate post-peering behaviour (at > least in my understanding). I wonder why it affects regular recovery > procedure where is no place for remapping and corresponding peering events. > > On 11/13/2013 09:51 PM, Gregory Farnum wrote: >> How did you generate these scenarios? At first glance it looks to me >> like you've got very low limits set on how many PGs an OSD can be >> recovering at once, and in the first example they were all targeted to >> that one OSD, while in the second they were distributed. >> -Greg >> Software Engineer #42 @ http://inktank.com | http://ceph.com >> >> >> On Wed, Nov 13, 2013 at 3:00 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >>> Hello, >>> >>> Using 5c65e1ee3932a021cfd900a74cdc1d43b9103f0f with >>> large amount of data commit and relatively low PG rate, >>> I`ve observed unexplainable long recovery times for PGs >>> even if the degraded object count is almost zero: >>> >>> 04:44:42.521896 mon.0 [INF] pgmap v24807947: 2048 pgs: 911 active+clean, >>> 1131 active+recovery_wait, 6 active+recovering; 5389 GB data, 16455 GB >>> used, 87692 GB / 101 TB avail; 5839KB/s rd, 2986KB/s wr, 567op/s; >>> 865/4162926 degraded (0.021%); recovering 2 o/s, 9251KB/s >>> >>> at this moment we have freshly restarted cluster and large amount of PGs >>> in recovery_wait state; after a couple of minutes picture changes a little: >>> >>> 2013-11-13 05:30:18.020093 mon.0 [INF] pgmap v24809483: 2048 pgs: 939 >>> active+clean, 1105 active+recovery_wait, 4 active+recovering; 5394 GB >>> data, 16472 GB used, 87676 GB / 101 TB avail; 1627KB/s rd, 3866KB/s wr, >>> 1499op/s; 2456/4167201 degraded (0.059%) >>> >>> and after a couple of hours we`re reaching a peak by degraded objects, >>> PGs still moving to active+clean: >>> >>> 2013-11-13 10:05:36.245917 mon.0 [INF] pgmap v24816326: 2048 pgs: 1191 >>> active+clean, 854 active+recovery_wait, 3 active+recovering; 5467 GB >>> data, 16690 GB used, 87457 GB / 101 TB avail; 16339KB/s rd, 18006KB/s >>> wr, 16025op/s; 23495/4223061 degraded (0.556%) >>> >>> After peak was passed, object count starts to decrease and seems cluster >>> will reach completely clean state in next ten hours. >>> >>> For example, with PG count ten times higher recovery goes way faster: >>> >>> 2013-11-05 03:20:56.330767 mon.0 [INF] pgmap v24143721: 27648 pgs: 26171 >>> active+clean, 1474 active+recovery_wait, 3 active+recovering; 7855 GB >>> data, 25609 GB used, 78538 GB / 101 TB avail; 3 >>> 298KB/s rd, 7746KB/s wr, 3581op/s; 183/6554634 degraded (0.003%) >>> >>> 2013-11-05 04:04:55.779345 mon.0 [INF] pgmap v24145291: 27648 pgs: 27646 >>> active+clean, 1 active+recovery_wait, 1 active+recovering; 7857 GB data, >>> 25615 GB used, 78533 GB / 101 TB avail; 999KB/s rd, 690KB/s wr, 563op/s >>> >>> Recovery and backfill settings was the same during all tests: >>> "osd_max_backfills": "1", >>> "osd_backfill_full_ratio": "0.85", >>> "osd_backfill_retry_interval": "10", >>> "osd_recovery_threads": "1", >>> "osd_recover_clone_overlap": "true", >>> "osd_backfill_scan_min": "64", >>> "osd_backfill_scan_max": "512", >>> "osd_recovery_thread_timeout": "30", >>> "osd_recovery_delay_start": "300", >>> "osd_recovery_max_active": "5", >>> "osd_recovery_max_chunk": "8388608", >>> "osd_recovery_forget_lost_objects": "false", >>> "osd_kill_backfill_at": "0", >>> "osd_debug_skip_full_check_in_backfill_reservation": "false", >>> "osd_recovery_op_priority": "10", >>> >>> >>> Also during recovery some heartbeats may miss, it is not related to the >>> current situation but observed for a very long time(for now, seems >>> four-seconds delays between heartbeats distributed almost randomly over >>> a time flow): >>> >>> 2013-11-13 14:57:11.316459 mon.0 [INF] pgmap v24826822: 2048 pgs: 1513 >>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >>> data, 16708 GB used, 87440 GB / 101 TB avail; 16098KB/s rd, 4085KB/s wr, >>> 623op/s; 15670/4227330 degraded (0.371%) >>> 2013-11-13 14:57:12.328538 mon.0 [INF] pgmap v24826823: 2048 pgs: 1513 >>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >>> data, 16708 GB used, 87440 GB / 101 TB avail; 3806KB/s rd, 3446KB/s wr, >>> 284op/s; 15670/4227330 degraded (0.371%) >>> 2013-11-13 14:57:13.336618 mon.0 [INF] pgmap v24826824: 2048 pgs: 1513 >>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >>> data, 16708 GB used, 87440 GB / 101 TB avail; 11051KB/s rd, 12171KB/s >>> wr, 1470op/s; 15670/4227330 degraded (0.371%) >>> 2013-11-13 14:57:16.317271 mon.0 [INF] pgmap v24826825: 2048 pgs: 1513 >>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >>> data, 16708 GB used, 87440 GB / 101 TB avail; 3610KB/s rd, 3171KB/s wr, >>> 1820op/s; 15670/4227330 degraded (0.371%) >>> 2013-11-13 14:57:17.366554 mon.0 [INF] pgmap v24826826: 2048 pgs: 1513 >>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >>> data, 16708 GB used, 87440 GB / 101 TB avail; 11323KB/s rd, 1759KB/s wr, >>> 13195op/s; 15670/4227330 degraded (0.371%) >>> 2013-11-13 14:57:18.379340 mon.0 [INF] pgmap v24826827: 2048 pgs: 1513 >>> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >>> data, 16708 GB used, 87440 GB / 101 TB avail; 38113KB/s rd, 7461KB/s wr, >>> 46511op/s; 15670/4227330 degraded (0.371%) >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com