In attached file I added two slices of degraded PGs for a first example and they belongs to completely different sets of OSD. I had to report that lowering 'osd recovery delay start' to default 15s value increased recovery speed a lot but documentation says that is should affect only immediate post-peering behaviour (at least in my understanding). I wonder why it affects regular recovery procedure where is no place for remapping and corresponding peering events. On 11/13/2013 09:51 PM, Gregory Farnum wrote: > How did you generate these scenarios? At first glance it looks to me > like you've got very low limits set on how many PGs an OSD can be > recovering at once, and in the first example they were all targeted to > that one OSD, while in the second they were distributed. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > > On Wed, Nov 13, 2013 at 3:00 AM, Andrey Korolyov <andrey@xxxxxxx> wrote: >> Hello, >> >> Using 5c65e1ee3932a021cfd900a74cdc1d43b9103f0f with >> large amount of data commit and relatively low PG rate, >> I`ve observed unexplainable long recovery times for PGs >> even if the degraded object count is almost zero: >> >> 04:44:42.521896 mon.0 [INF] pgmap v24807947: 2048 pgs: 911 active+clean, >> 1131 active+recovery_wait, 6 active+recovering; 5389 GB data, 16455 GB >> used, 87692 GB / 101 TB avail; 5839KB/s rd, 2986KB/s wr, 567op/s; >> 865/4162926 degraded (0.021%); recovering 2 o/s, 9251KB/s >> >> at this moment we have freshly restarted cluster and large amount of PGs >> in recovery_wait state; after a couple of minutes picture changes a little: >> >> 2013-11-13 05:30:18.020093 mon.0 [INF] pgmap v24809483: 2048 pgs: 939 >> active+clean, 1105 active+recovery_wait, 4 active+recovering; 5394 GB >> data, 16472 GB used, 87676 GB / 101 TB avail; 1627KB/s rd, 3866KB/s wr, >> 1499op/s; 2456/4167201 degraded (0.059%) >> >> and after a couple of hours we`re reaching a peak by degraded objects, >> PGs still moving to active+clean: >> >> 2013-11-13 10:05:36.245917 mon.0 [INF] pgmap v24816326: 2048 pgs: 1191 >> active+clean, 854 active+recovery_wait, 3 active+recovering; 5467 GB >> data, 16690 GB used, 87457 GB / 101 TB avail; 16339KB/s rd, 18006KB/s >> wr, 16025op/s; 23495/4223061 degraded (0.556%) >> >> After peak was passed, object count starts to decrease and seems cluster >> will reach completely clean state in next ten hours. >> >> For example, with PG count ten times higher recovery goes way faster: >> >> 2013-11-05 03:20:56.330767 mon.0 [INF] pgmap v24143721: 27648 pgs: 26171 >> active+clean, 1474 active+recovery_wait, 3 active+recovering; 7855 GB >> data, 25609 GB used, 78538 GB / 101 TB avail; 3 >> 298KB/s rd, 7746KB/s wr, 3581op/s; 183/6554634 degraded (0.003%) >> >> 2013-11-05 04:04:55.779345 mon.0 [INF] pgmap v24145291: 27648 pgs: 27646 >> active+clean, 1 active+recovery_wait, 1 active+recovering; 7857 GB data, >> 25615 GB used, 78533 GB / 101 TB avail; 999KB/s rd, 690KB/s wr, 563op/s >> >> Recovery and backfill settings was the same during all tests: >> "osd_max_backfills": "1", >> "osd_backfill_full_ratio": "0.85", >> "osd_backfill_retry_interval": "10", >> "osd_recovery_threads": "1", >> "osd_recover_clone_overlap": "true", >> "osd_backfill_scan_min": "64", >> "osd_backfill_scan_max": "512", >> "osd_recovery_thread_timeout": "30", >> "osd_recovery_delay_start": "300", >> "osd_recovery_max_active": "5", >> "osd_recovery_max_chunk": "8388608", >> "osd_recovery_forget_lost_objects": "false", >> "osd_kill_backfill_at": "0", >> "osd_debug_skip_full_check_in_backfill_reservation": "false", >> "osd_recovery_op_priority": "10", >> >> >> Also during recovery some heartbeats may miss, it is not related to the >> current situation but observed for a very long time(for now, seems >> four-seconds delays between heartbeats distributed almost randomly over >> a time flow): >> >> 2013-11-13 14:57:11.316459 mon.0 [INF] pgmap v24826822: 2048 pgs: 1513 >> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >> data, 16708 GB used, 87440 GB / 101 TB avail; 16098KB/s rd, 4085KB/s wr, >> 623op/s; 15670/4227330 degraded (0.371%) >> 2013-11-13 14:57:12.328538 mon.0 [INF] pgmap v24826823: 2048 pgs: 1513 >> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >> data, 16708 GB used, 87440 GB / 101 TB avail; 3806KB/s rd, 3446KB/s wr, >> 284op/s; 15670/4227330 degraded (0.371%) >> 2013-11-13 14:57:13.336618 mon.0 [INF] pgmap v24826824: 2048 pgs: 1513 >> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >> data, 16708 GB used, 87440 GB / 101 TB avail; 11051KB/s rd, 12171KB/s >> wr, 1470op/s; 15670/4227330 degraded (0.371%) >> 2013-11-13 14:57:16.317271 mon.0 [INF] pgmap v24826825: 2048 pgs: 1513 >> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >> data, 16708 GB used, 87440 GB / 101 TB avail; 3610KB/s rd, 3171KB/s wr, >> 1820op/s; 15670/4227330 degraded (0.371%) >> 2013-11-13 14:57:17.366554 mon.0 [INF] pgmap v24826826: 2048 pgs: 1513 >> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >> data, 16708 GB used, 87440 GB / 101 TB avail; 11323KB/s rd, 1759KB/s wr, >> 13195op/s; 15670/4227330 degraded (0.371%) >> 2013-11-13 14:57:18.379340 mon.0 [INF] pgmap v24826827: 2048 pgs: 1513 >> active+clean, 529 active+recovery_wait, 6 active+recovering; 5469 GB >> data, 16708 GB used, 87440 GB / 101 TB avail; 38113KB/s rd, 7461KB/s wr, >> 46511op/s; 15670/4227330 degraded (0.371%) >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Attachment:
pg-dump.txt.gz
Description: GNU Zip compressed data
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com