Have you tried setting osd_recovery_clone_overlap to false? That seemed to help with Stefan's issue. -Sam On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> wrote: > Sam/Josh, > > We upgraded from 0.61.7 to 0.67.1 during a maintenance window this morning, > hoping it would improve this situation, but there was no appreciable change. > > One node in our cluster fsck'ed after a reboot and got a bit behind. Our > instances backed by RBD volumes were OK at that point, but once the node > booted fully and the OSDs started, all Windows instances with rbd volumes > experienced very choppy performance and were unable to ingest video > surveillance traffic and commit it to disk. Once the cluster got back to > HEALTH_OK, they resumed normal operation. > > I tried for a time with conservative recovery settings (osd max backfills = > 1, osd recovery op priority = 1, and osd recovery max active = 1). No > improvement for the guests. So I went to more aggressive settings to get > things moving faster. That decreased the duration of the outage. > > During the entire period of recovery/backfill, the network looked fine...no > where close to saturation. iowait on all drives look fine as well. > > Any ideas? > > Thanks, > Mike Dawson > > > > On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: >> >> the same problem still occours. Will need to check when i've time to >> gather logs again. >> >> Am 14.08.2013 01:11, schrieb Samuel Just: >>> >>> I'm not sure, but your logs did show that you had >16 recovery ops in >>> flight, so it's worth a try. If it doesn't help, you should collect >>> the same set of logs I'll look again. Also, there are a few other >>> patches between 61.7 and current cuttlefish which may help. >>> -Sam >>> >>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG >>> <s.priebe@xxxxxxxxxxxx> wrote: >>>> >>>> >>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>> >>>>> I just backported a couple of patches from next to fix a bug where we >>>>> weren't respecting the osd_recovery_max_active config in some cases >>>>> (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either try the >>>>> current cuttlefish branch or wait for a 61.8 release. >>>> >>>> >>>> Thanks! Are you sure that this is the issue? I don't believe that but >>>> i'll give it a try. I already tested a branch from sage where he fixed a >>>> race regarding max active some weeks ago. So active recovering was max 1 but >>>> the issue didn't went away. >>>> >>>> Stefan >>>> >>>>> -Sam >>>>> >>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just <sam.just@xxxxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> I got swamped today. I should be able to look tomorrow. Sorry! >>>>>> -Sam >>>>>> >>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG >>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> Did you take a look? >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>>>> >>>>>>>> Great! I'll take a look on Monday. >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> Hi Samual, >>>>>>>>> >>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just: >>>>>>>>> >>>>>>>>>> I think Stefan's problem is probably distinct from Mike's. >>>>>>>>>> >>>>>>>>>> Stefan: Can you reproduce the problem with >>>>>>>>>> >>>>>>>>>> debug osd = 20 >>>>>>>>>> debug filestore = 20 >>>>>>>>>> debug ms = 1 >>>>>>>>>> debug optracker = 20 >>>>>>>>>> >>>>>>>>>> on a few osds (including the restarted osd), and upload those osd >>>>>>>>>> logs >>>>>>>>>> along with the ceph.log from before killing the osd until after >>>>>>>>>> the >>>>>>>>>> cluster becomes clean again? >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> done - you'll find the logs at cephdrop folder: >>>>>>>>> slow_requests_recovering_cuttlefish >>>>>>>>> >>>>>>>>> osd.52 was the one recovering >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> Greets, >>>>>>>>> Stefan >>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>> ceph-devel" in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html