Ah, thanks for the correction. -Sam On Wed, Aug 21, 2013 at 9:25 AM, Yann ROBIN <yann.robin@xxxxxxxxxxxxx> wrote: > It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401) > > -----Original Message----- > From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Samuel Just > Sent: mercredi 21 août 2013 17:33 > To: Mike Dawson > Cc: Stefan Priebe - Profihost AG; josh.durgin@xxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx > Subject: Re: still recovery issues with cuttlefish > > Have you tried setting osd_recovery_clone_overlap to false? That seemed to help with Stefan's issue. > -Sam > > On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> wrote: >> Sam/Josh, >> >> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this >> morning, hoping it would improve this situation, but there was no appreciable change. >> >> One node in our cluster fsck'ed after a reboot and got a bit behind. >> Our instances backed by RBD volumes were OK at that point, but once >> the node booted fully and the OSDs started, all Windows instances with >> rbd volumes experienced very choppy performance and were unable to >> ingest video surveillance traffic and commit it to disk. Once the >> cluster got back to HEALTH_OK, they resumed normal operation. >> >> I tried for a time with conservative recovery settings (osd max >> backfills = 1, osd recovery op priority = 1, and osd recovery max >> active = 1). No improvement for the guests. So I went to more >> aggressive settings to get things moving faster. That decreased the duration of the outage. >> >> During the entire period of recovery/backfill, the network looked >> fine...no where close to saturation. iowait on all drives look fine as well. >> >> Any ideas? >> >> Thanks, >> Mike Dawson >> >> >> >> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: >>> >>> the same problem still occours. Will need to check when i've time to >>> gather logs again. >>> >>> Am 14.08.2013 01:11, schrieb Samuel Just: >>>> >>>> I'm not sure, but your logs did show that you had >16 recovery ops >>>> in flight, so it's worth a try. If it doesn't help, you should >>>> collect the same set of logs I'll look again. Also, there are a few >>>> other patches between 61.7 and current cuttlefish which may help. >>>> -Sam >>>> >>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG >>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>> >>>>> >>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>> >>>>>> I just backported a couple of patches from next to fix a bug where >>>>>> we weren't respecting the osd_recovery_max_active config in some >>>>>> cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either >>>>>> try the current cuttlefish branch or wait for a 61.8 release. >>>>> >>>>> >>>>> Thanks! Are you sure that this is the issue? I don't believe that >>>>> but i'll give it a try. I already tested a branch from sage where >>>>> he fixed a race regarding max active some weeks ago. So active >>>>> recovering was max 1 but the issue didn't went away. >>>>> >>>>> Stefan >>>>> >>>>>> -Sam >>>>>> >>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just >>>>>> <sam.just@xxxxxxxxxxx> >>>>>> wrote: >>>>>>> >>>>>>> I got swamped today. I should be able to look tomorrow. Sorry! >>>>>>> -Sam >>>>>>> >>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG >>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>> >>>>>>>> Did you take a look? >>>>>>>> >>>>>>>> Stefan >>>>>>>> >>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>>>>> >>>>>>>>> Great! I'll take a look on Monday. >>>>>>>>> -Sam >>>>>>>>> >>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe >>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> Hi Samual, >>>>>>>>>> >>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just: >>>>>>>>>> >>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's. >>>>>>>>>>> >>>>>>>>>>> Stefan: Can you reproduce the problem with >>>>>>>>>>> >>>>>>>>>>> debug osd = 20 >>>>>>>>>>> debug filestore = 20 >>>>>>>>>>> debug ms = 1 >>>>>>>>>>> debug optracker = 20 >>>>>>>>>>> >>>>>>>>>>> on a few osds (including the restarted osd), and upload those >>>>>>>>>>> osd logs along with the ceph.log from before killing the osd >>>>>>>>>>> until after the cluster becomes clean again? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> done - you'll find the logs at cephdrop folder: >>>>>>>>>> slow_requests_recovering_cuttlefish >>>>>>>>>> >>>>>>>>>> osd.52 was the one recovering >>>>>>>>>> >>>>>>>>>> Thanks! >>>>>>>>>> >>>>>>>>>> Greets, >>>>>>>>>> Stefan >>>>>>>>> >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>> ceph-devel" in the body of a message to >>>>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at >>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>> in >>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>> info at http://vger.kernel.org/majordomo-info.html >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe >>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html