If the raring guest was fine, I suspect that the issue is not on the OSDs. -Sam On Wed, Aug 21, 2013 at 10:55 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> wrote: > Sam, > > Tried it. Injected with 'ceph tell osd.* injectargs -- > --no_osd_recover_clone_overlap', then stopped one OSD for ~1 minute. Upon > restart, all my Windows VMs have issues until HEALTH_OK. > > The recovery was taking an abnormally long time, so I reverted away from > --no_osd_recover_clone_overlap after about 10mins, to get back to HEALTH_OK. > > Interestingly, a Raring guest running a different video surveillance package > proceeded without any issue whatsoever. > > Here is an image of the traffic to some of these Windows guests: > > http://www.gammacode.com/upload/rbd-hang-with-clone-overlap.jpg > > Ceph is outside of HEALTH_OK between ~12:55 and 13:10. Most of these > instances rebooted due to an app error caused by the i/o hang shortly after > 13:10. > > These Windows instances are booted as COW clones from a Glance image using > Cinder. They also have a second RBD volume for bulk storage. I'm using qemu > 1.5.2. > > Thanks, > Mike > > > > On 8/21/2013 1:12 PM, Samuel Just wrote: >> >> Ah, thanks for the correction. >> -Sam >> >> On Wed, Aug 21, 2013 at 9:25 AM, Yann ROBIN <yann.robin@xxxxxxxxxxxxx> >> wrote: >>> >>> It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401) >>> >>> -----Original Message----- >>> From: ceph-devel-owner@xxxxxxxxxxxxxxx >>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Samuel Just >>> Sent: mercredi 21 août 2013 17:33 >>> To: Mike Dawson >>> Cc: Stefan Priebe - Profihost AG; josh.durgin@xxxxxxxxxxx; >>> ceph-devel@xxxxxxxxxxxxxxx >>> Subject: Re: still recovery issues with cuttlefish >>> >>> Have you tried setting osd_recovery_clone_overlap to false? That seemed >>> to help with Stefan's issue. >>> -Sam >>> >>> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> >>> wrote: >>>> >>>> Sam/Josh, >>>> >>>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this >>>> morning, hoping it would improve this situation, but there was no >>>> appreciable change. >>>> >>>> One node in our cluster fsck'ed after a reboot and got a bit behind. >>>> Our instances backed by RBD volumes were OK at that point, but once >>>> the node booted fully and the OSDs started, all Windows instances with >>>> rbd volumes experienced very choppy performance and were unable to >>>> ingest video surveillance traffic and commit it to disk. Once the >>>> cluster got back to HEALTH_OK, they resumed normal operation. >>>> >>>> I tried for a time with conservative recovery settings (osd max >>>> backfills = 1, osd recovery op priority = 1, and osd recovery max >>>> active = 1). No improvement for the guests. So I went to more >>>> aggressive settings to get things moving faster. That decreased the >>>> duration of the outage. >>>> >>>> During the entire period of recovery/backfill, the network looked >>>> fine...no where close to saturation. iowait on all drives look fine as >>>> well. >>>> >>>> Any ideas? >>>> >>>> Thanks, >>>> Mike Dawson >>>> >>>> >>>> >>>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: >>>>> >>>>> >>>>> the same problem still occours. Will need to check when i've time to >>>>> gather logs again. >>>>> >>>>> Am 14.08.2013 01:11, schrieb Samuel Just: >>>>>> >>>>>> >>>>>> I'm not sure, but your logs did show that you had >16 recovery ops >>>>>> in flight, so it's worth a try. If it doesn't help, you should >>>>>> collect the same set of logs I'll look again. Also, there are a few >>>>>> other patches between 61.7 and current cuttlefish which may help. >>>>>> -Sam >>>>>> >>>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG >>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>>>> >>>>>>>> I just backported a couple of patches from next to fix a bug where >>>>>>>> we weren't respecting the osd_recovery_max_active config in some >>>>>>>> cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either >>>>>>>> try the current cuttlefish branch or wait for a 61.8 release. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks! Are you sure that this is the issue? I don't believe that >>>>>>> but i'll give it a try. I already tested a branch from sage where >>>>>>> he fixed a race regarding max active some weeks ago. So active >>>>>>> recovering was max 1 but the issue didn't went away. >>>>>>> >>>>>>> Stefan >>>>>>> >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just >>>>>>>> <sam.just@xxxxxxxxxxx> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> I got swamped today. I should be able to look tomorrow. Sorry! >>>>>>>>> -Sam >>>>>>>>> >>>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG >>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Did you take a look? >>>>>>>>>> >>>>>>>>>> Stefan >>>>>>>>>> >>>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>>>>>>> >>>>>>>>>>> Great! I'll take a look on Monday. >>>>>>>>>>> -Sam >>>>>>>>>>> >>>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe >>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Samual, >>>>>>>>>>>> >>>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just: >>>>>>>>>>>> >>>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's. >>>>>>>>>>>>> >>>>>>>>>>>>> Stefan: Can you reproduce the problem with >>>>>>>>>>>>> >>>>>>>>>>>>> debug osd = 20 >>>>>>>>>>>>> debug filestore = 20 >>>>>>>>>>>>> debug ms = 1 >>>>>>>>>>>>> debug optracker = 20 >>>>>>>>>>>>> >>>>>>>>>>>>> on a few osds (including the restarted osd), and upload those >>>>>>>>>>>>> osd logs along with the ceph.log from before killing the osd >>>>>>>>>>>>> until after the cluster becomes clean again? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> done - you'll find the logs at cephdrop folder: >>>>>>>>>>>> slow_requests_recovering_cuttlefish >>>>>>>>>>>> >>>>>>>>>>>> osd.52 was the one recovering >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>>> Greets, >>>>>>>>>>>> Stefan >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>> ceph-devel" in the body of a message to >>>>>>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at >>>>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>> ceph-devel" >>>>>>>> in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>>>>>>> info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at >>> http://vger.kernel.org/majordomo-info.html >>> >>> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html