It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401) -----Original Message----- From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Samuel Just Sent: mercredi 21 août 2013 17:33 To: Mike Dawson Cc: Stefan Priebe - Profihost AG; josh.durgin@xxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx Subject: Re: still recovery issues with cuttlefish Have you tried setting osd_recovery_clone_overlap to false? That seemed to help with Stefan's issue. -Sam On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> wrote: > Sam/Josh, > > We upgraded from 0.61.7 to 0.67.1 during a maintenance window this > morning, hoping it would improve this situation, but there was no appreciable change. > > One node in our cluster fsck'ed after a reboot and got a bit behind. > Our instances backed by RBD volumes were OK at that point, but once > the node booted fully and the OSDs started, all Windows instances with > rbd volumes experienced very choppy performance and were unable to > ingest video surveillance traffic and commit it to disk. Once the > cluster got back to HEALTH_OK, they resumed normal operation. > > I tried for a time with conservative recovery settings (osd max > backfills = 1, osd recovery op priority = 1, and osd recovery max > active = 1). No improvement for the guests. So I went to more > aggressive settings to get things moving faster. That decreased the duration of the outage. > > During the entire period of recovery/backfill, the network looked > fine...no where close to saturation. iowait on all drives look fine as well. > > Any ideas? > > Thanks, > Mike Dawson > > > > On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: >> >> the same problem still occours. Will need to check when i've time to >> gather logs again. >> >> Am 14.08.2013 01:11, schrieb Samuel Just: >>> >>> I'm not sure, but your logs did show that you had >16 recovery ops >>> in flight, so it's worth a try. If it doesn't help, you should >>> collect the same set of logs I'll look again. Also, there are a few >>> other patches between 61.7 and current cuttlefish which may help. >>> -Sam >>> >>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG >>> <s.priebe@xxxxxxxxxxxx> wrote: >>>> >>>> >>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>> >>>>> I just backported a couple of patches from next to fix a bug where >>>>> we weren't respecting the osd_recovery_max_active config in some >>>>> cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either >>>>> try the current cuttlefish branch or wait for a 61.8 release. >>>> >>>> >>>> Thanks! Are you sure that this is the issue? I don't believe that >>>> but i'll give it a try. I already tested a branch from sage where >>>> he fixed a race regarding max active some weeks ago. So active >>>> recovering was max 1 but the issue didn't went away. >>>> >>>> Stefan >>>> >>>>> -Sam >>>>> >>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just >>>>> <sam.just@xxxxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> I got swamped today. I should be able to look tomorrow. Sorry! >>>>>> -Sam >>>>>> >>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG >>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> Did you take a look? >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>>>> >>>>>>>> Great! I'll take a look on Monday. >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> Hi Samual, >>>>>>>>> >>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just: >>>>>>>>> >>>>>>>>>> I think Stefan's problem is probably distinct from Mike's. >>>>>>>>>> >>>>>>>>>> Stefan: Can you reproduce the problem with >>>>>>>>>> >>>>>>>>>> debug osd = 20 >>>>>>>>>> debug filestore = 20 >>>>>>>>>> debug ms = 1 >>>>>>>>>> debug optracker = 20 >>>>>>>>>> >>>>>>>>>> on a few osds (including the restarted osd), and upload those >>>>>>>>>> osd logs along with the ceph.log from before killing the osd >>>>>>>>>> until after the cluster becomes clean again? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> done - you'll find the logs at cephdrop folder: >>>>>>>>> slow_requests_recovering_cuttlefish >>>>>>>>> >>>>>>>>> osd.52 was the one recovering >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>> ceph-devel" in the body of a message to >>>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at >>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>> info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html