Re: still recovery issues with cuttlefish

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 22.08.2013 05:34, schrieb Samuel Just:
> It's not really possible at this time to control that limit because
> changing the primary is actually fairly expensive and doing it
> unnecessarily would probably make the situation much worse

I'm sorry but remapping or backfilling is far less expensive on all of
my machines than recovering.

While backfilling i've around 8-10% I/O waits while under recovery i
have 40%-50%


 (it's
> mostly necessary for backfilling, which is expensive anyway).  It
> seems like forwarding IO on an object which needs to be recovered to a
> replica with the object would be the next step.  Certainly something
> to consider for the future.

Yes this would be the solution.

Stefan

> -Sam
> 
> On Wed, Aug 21, 2013 at 12:37 PM, Stefan Priebe <s.priebe@xxxxxxxxxxxx> wrote:
>> Hi Sam,
>> Am 21.08.2013 21:13, schrieb Samuel Just:
>>
>>> As long as the request is for an object which is up to date on the
>>> primary, the request will be served without waiting for recovery.
>>
>>
>> Sure but remember if you have VM random 4K workload a lot of objects go out
>> of date pretty soon.
>>
>>
>>> A request only waits on recovery if the particular object being read or
>>>
>>> written must be recovered.
>>
>>
>> Yes but on 4k load this can be a lot.
>>
>>
>>> Your issue was that recovering the
>>> particular object being requested was unreasonably slow due to
>>> silliness in the recovery code which you disabled by disabling
>>> osd_recover_clone_overlap.
>>
>>
>> Yes and no. It's better now but far away from being good or perfect. My VMs
>> do not crash anymore but i still have a bunch of slow requests (just around
>> 10 messages) and still a VERY high I/O load on the disks during recovery.
>>
>>
>>> In cases where the primary osd is significantly behind, we do make one
>>> of the other osds primary during recovery in order to expedite
>>> requests (pgs in this state are shown as remapped).
>>
>>
>> oh never seen that but at least in my case even 60s are a very long
>> timeframe and the OSD is very stressed during recovery. Is it possible for
>> me to set this value?
>>
>>
>> Stefan
>>
>>> -Sam
>>>
>>> On Wed, Aug 21, 2013 at 11:21 AM, Stefan Priebe <s.priebe@xxxxxxxxxxxx>
>>> wrote:
>>>>
>>>> Am 21.08.2013 17:32, schrieb Samuel Just:
>>>>
>>>>> Have you tried setting osd_recovery_clone_overlap to false?  That
>>>>> seemed to help with Stefan's issue.
>>>>
>>>>
>>>>
>>>> This might sound a bug harsh but maybe due to my limited english skills
>>>> ;-)
>>>>
>>>> I still think that Cephs recovery system is broken by design. If an OSD
>>>> comes back (was offline) all write requests regarding PGs where this one
>>>> is
>>>> primary are targeted immediatly to this OSD. If this one is not up2date
>>>> for
>>>> an PG it tries to recover that one immediatly which costs 4MB / block. If
>>>> you have a lot of small write all over your OSDs and PGs you're sucked as
>>>> your OSD has to recover ALL it's PGs immediatly or at least lots of them
>>>> WHICH can't work. This is totally crazy.
>>>>
>>>> I think the right way would be:
>>>> 1.) if an OSD goes down the replicas got primaries
>>>>
>>>> or
>>>>
>>>> 2.) an OSD which does not have an up2date PG should redirect to the OSD
>>>> holding the secondary or third replica.
>>>>
>>>> Both results in being able to have a really smooth and slow recovery
>>>> without
>>>> any stress even under heavy 4K workloads like rbd backed VMs.
>>>>
>>>> Thanks for reading!
>>>>
>>>> Greets Stefan
>>>>
>>>>
>>>>
>>>>> -Sam
>>>>>
>>>>> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> Sam/Josh,
>>>>>>
>>>>>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
>>>>>> morning,
>>>>>> hoping it would improve this situation, but there was no appreciable
>>>>>> change.
>>>>>>
>>>>>> One node in our cluster fsck'ed after a reboot and got a bit behind.
>>>>>> Our
>>>>>> instances backed by RBD volumes were OK at that point, but once the
>>>>>> node
>>>>>> booted fully and the OSDs started, all Windows instances with rbd
>>>>>> volumes
>>>>>> experienced very choppy performance and were unable to ingest video
>>>>>> surveillance traffic and commit it to disk. Once the cluster got back
>>>>>> to
>>>>>> HEALTH_OK, they resumed normal operation.
>>>>>>
>>>>>> I tried for a time with conservative recovery settings (osd max
>>>>>> backfills
>>>>>> =
>>>>>> 1, osd recovery op priority = 1, and osd recovery max active = 1). No
>>>>>> improvement for the guests. So I went to more aggressive settings to
>>>>>> get
>>>>>> things moving faster. That decreased the duration of the outage.
>>>>>>
>>>>>> During the entire period of recovery/backfill, the network looked
>>>>>> fine...no
>>>>>> where close to saturation. iowait on all drives look fine as well.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Thanks,
>>>>>> Mike Dawson
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> the same problem still occours. Will need to check when i've time to
>>>>>>> gather logs again.
>>>>>>>
>>>>>>> Am 14.08.2013 01:11, schrieb Samuel Just:
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure, but your logs did show that you had >16 recovery ops in
>>>>>>>> flight, so it's worth a try.  If it doesn't help, you should collect
>>>>>>>> the same set of logs I'll look again.  Also, there are a few other
>>>>>>>> patches between 61.7 and current cuttlefish which may help.
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>>>>>>
>>>>>>>>>> I just backported a couple of patches from next to fix a bug where
>>>>>>>>>> we
>>>>>>>>>> weren't respecting the osd_recovery_max_active config in some cases
>>>>>>>>>> (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either try the
>>>>>>>>>> current cuttlefish branch or wait for a 61.8 release.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks! Are you sure that this is the issue? I don't believe that
>>>>>>>>> but
>>>>>>>>> i'll give it a try. I already tested a branch from sage where he
>>>>>>>>> fixed
>>>>>>>>> a
>>>>>>>>> race regarding max active some weeks ago. So active recovering was
>>>>>>>>> max
>>>>>>>>> 1 but
>>>>>>>>> the issue didn't went away.
>>>>>>>>>
>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>>> -Sam
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just
>>>>>>>>>> <sam.just@xxxxxxxxxxx>
>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I got swamped today.  I should be able to look tomorrow.  Sorry!
>>>>>>>>>>> -Sam
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Did you take a look?
>>>>>>>>>>>>
>>>>>>>>>>>> Stefan
>>>>>>>>>>>>
>>>>>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just
>>>>>>>>>>>> <sam.just@xxxxxxxxxxx>:
>>>>>>>>>>>>
>>>>>>>>>>>>> Great!  I'll take a look on Monday.
>>>>>>>>>>>>> -Sam
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
>>>>>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi Samual,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Stefan: Can you reproduce the problem with
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> debug osd = 20
>>>>>>>>>>>>>>> debug filestore = 20
>>>>>>>>>>>>>>> debug ms = 1
>>>>>>>>>>>>>>> debug optracker = 20
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> on a few osds (including the restarted osd), and upload those
>>>>>>>>>>>>>>> osd
>>>>>>>>>>>>>>> logs
>>>>>>>>>>>>>>> along with the ceph.log from before killing the osd until
>>>>>>>>>>>>>>> after
>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>> cluster becomes clean again?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> done - you'll find the logs at cephdrop folder:
>>>>>>>>>>>>>> slow_requests_recovering_cuttlefish
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> osd.52 was the one recovering
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Greets,
>>>>>>>>>>>>>> Stefan
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>>>> ceph-devel" in
>>>>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>>>>> More majordomo info at
>>>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>> ceph-devel"
>>>>>>>>>> in
>>>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>>>> in
>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux