Am 22.08.2013 05:34, schrieb Samuel Just: > It's not really possible at this time to control that limit because > changing the primary is actually fairly expensive and doing it > unnecessarily would probably make the situation much worse I'm sorry but remapping or backfilling is far less expensive on all of my machines than recovering. While backfilling i've around 8-10% I/O waits while under recovery i have 40%-50% (it's > mostly necessary for backfilling, which is expensive anyway). It > seems like forwarding IO on an object which needs to be recovered to a > replica with the object would be the next step. Certainly something > to consider for the future. Yes this would be the solution. Stefan > -Sam > > On Wed, Aug 21, 2013 at 12:37 PM, Stefan Priebe <s.priebe@xxxxxxxxxxxx> wrote: >> Hi Sam, >> Am 21.08.2013 21:13, schrieb Samuel Just: >> >>> As long as the request is for an object which is up to date on the >>> primary, the request will be served without waiting for recovery. >> >> >> Sure but remember if you have VM random 4K workload a lot of objects go out >> of date pretty soon. >> >> >>> A request only waits on recovery if the particular object being read or >>> >>> written must be recovered. >> >> >> Yes but on 4k load this can be a lot. >> >> >>> Your issue was that recovering the >>> particular object being requested was unreasonably slow due to >>> silliness in the recovery code which you disabled by disabling >>> osd_recover_clone_overlap. >> >> >> Yes and no. It's better now but far away from being good or perfect. My VMs >> do not crash anymore but i still have a bunch of slow requests (just around >> 10 messages) and still a VERY high I/O load on the disks during recovery. >> >> >>> In cases where the primary osd is significantly behind, we do make one >>> of the other osds primary during recovery in order to expedite >>> requests (pgs in this state are shown as remapped). >> >> >> oh never seen that but at least in my case even 60s are a very long >> timeframe and the OSD is very stressed during recovery. Is it possible for >> me to set this value? >> >> >> Stefan >> >>> -Sam >>> >>> On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe <s.priebe@xxxxxxxxxxxx> >>> wrote: >>>> >>>> Am 21.08.2013 17:32, schrieb Samuel Just: >>>> >>>>> Have you tried setting osd_recovery_clone_overlap to false? That >>>>> seemed to help with Stefan's issue. >>>> >>>> >>>> >>>> This might sound a bug harsh but maybe due to my limited english skills >>>> ;-) >>>> >>>> I still think that Cephs recovery system is broken by design. If an OSD >>>> comes back (was offline) all write requests regarding PGs where this one >>>> is >>>> primary are targeted immediatly to this OSD. If this one is not up2date >>>> for >>>> an PG it tries to recover that one immediatly which costs 4MB / block. If >>>> you have a lot of small write all over your OSDs and PGs you're sucked as >>>> your OSD has to recover ALL it's PGs immediatly or at least lots of them >>>> WHICH can't work. This is totally crazy. >>>> >>>> I think the right way would be: >>>> 1.) if an OSD goes down the replicas got primaries >>>> >>>> or >>>> >>>> 2.) an OSD which does not have an up2date PG should redirect to the OSD >>>> holding the secondary or third replica. >>>> >>>> Both results in being able to have a really smooth and slow recovery >>>> without >>>> any stress even under heavy 4K workloads like rbd backed VMs. >>>> >>>> Thanks for reading! >>>> >>>> Greets Stefan >>>> >>>> >>>> >>>>> -Sam >>>>> >>>>> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> >>>>>> Sam/Josh, >>>>>> >>>>>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this >>>>>> morning, >>>>>> hoping it would improve this situation, but there was no appreciable >>>>>> change. >>>>>> >>>>>> One node in our cluster fsck'ed after a reboot and got a bit behind. >>>>>> Our >>>>>> instances backed by RBD volumes were OK at that point, but once the >>>>>> node >>>>>> booted fully and the OSDs started, all Windows instances with rbd >>>>>> volumes >>>>>> experienced very choppy performance and were unable to ingest video >>>>>> surveillance traffic and commit it to disk. Once the cluster got back >>>>>> to >>>>>> HEALTH_OK, they resumed normal operation. >>>>>> >>>>>> I tried for a time with conservative recovery settings (osd max >>>>>> backfills >>>>>> = >>>>>> 1, osd recovery op priority = 1, and osd recovery max active = 1). No >>>>>> improvement for the guests. So I went to more aggressive settings to >>>>>> get >>>>>> things moving faster. That decreased the duration of the outage. >>>>>> >>>>>> During the entire period of recovery/backfill, the network looked >>>>>> fine...no >>>>>> where close to saturation. iowait on all drives look fine as well. >>>>>> >>>>>> Any ideas? >>>>>> >>>>>> Thanks, >>>>>> Mike Dawson >>>>>> >>>>>> >>>>>> >>>>>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> the same problem still occours. Will need to check when i've time to >>>>>>> gather logs again. >>>>>>> >>>>>>> Am 14.08.2013 01:11, schrieb Samuel Just: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I'm not sure, but your logs did show that you had >16 recovery ops in >>>>>>>> flight, so it's worth a try. If it doesn't help, you should collect >>>>>>>> the same set of logs I'll look again. Also, there are a few other >>>>>>>> patches between 61.7 and current cuttlefish which may help. >>>>>>>> -Sam >>>>>>>> >>>>>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG >>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>: >>>>>>>>> >>>>>>>>>> I just backported a couple of patches from next to fix a bug where >>>>>>>>>> we >>>>>>>>>> weren't respecting the osd_recovery_max_active config in some cases >>>>>>>>>> (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e). You can either try the >>>>>>>>>> current cuttlefish branch or wait for a 61.8 release. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> Thanks! Are you sure that this is the issue? I don't believe that >>>>>>>>> but >>>>>>>>> i'll give it a try. I already tested a branch from sage where he >>>>>>>>> fixed >>>>>>>>> a >>>>>>>>> race regarding max active some weeks ago. So active recovering was >>>>>>>>> max >>>>>>>>> 1 but >>>>>>>>> the issue didn't went away. >>>>>>>>> >>>>>>>>> Stefan >>>>>>>>> >>>>>>>>>> -Sam >>>>>>>>>> >>>>>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just >>>>>>>>>> <sam.just@xxxxxxxxxxx> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I got swamped today. I should be able to look tomorrow. Sorry! >>>>>>>>>>> -Sam >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG >>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Did you take a look? >>>>>>>>>>>> >>>>>>>>>>>> Stefan >>>>>>>>>>>> >>>>>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just >>>>>>>>>>>> <sam.just@xxxxxxxxxxx>: >>>>>>>>>>>> >>>>>>>>>>>>> Great! I'll take a look on Monday. >>>>>>>>>>>>> -Sam >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe >>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi Samual, >>>>>>>>>>>>>> >>>>>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Stefan: Can you reproduce the problem with >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> debug osd = 20 >>>>>>>>>>>>>>> debug filestore = 20 >>>>>>>>>>>>>>> debug ms = 1 >>>>>>>>>>>>>>> debug optracker = 20 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> on a few osds (including the restarted osd), and upload those >>>>>>>>>>>>>>> osd >>>>>>>>>>>>>>> logs >>>>>>>>>>>>>>> along with the ceph.log from before killing the osd until >>>>>>>>>>>>>>> after >>>>>>>>>>>>>>> the >>>>>>>>>>>>>>> cluster becomes clean again? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> done - you'll find the logs at cephdrop folder: >>>>>>>>>>>>>> slow_requests_recovering_cuttlefish >>>>>>>>>>>>>> >>>>>>>>>>>>>> osd.52 was the one recovering >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks! >>>>>>>>>>>>>> >>>>>>>>>>>>>> Greets, >>>>>>>>>>>>>> Stefan >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>>>>> ceph-devel" in >>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>>>>> More majordomo info at >>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>>>> ceph-devel" >>>>>>>>>> in >>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>>>>> in >>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>> >>>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html