Re: still recovery issues with cuttlefish

Samuel Just <sam.just@xxxxxxxxxxx> · Wed, 21 Aug 2013 10:12:10 -0700



Ah, thanks for the correction.
-Sam

On Wed, Aug 21, 2013 at 9:25 AM, Yann ROBIN <yann.robin@xxxxxxxxxxxxx> wrote:
> It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401)
>
> -----Original Message-----
> From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Samuel Just
> Sent: mercredi 21 août 2013 17:33
> To: Mike Dawson
> Cc: Stefan Priebe - Profihost AG; josh.durgin@xxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
> Subject: Re: still recovery issues with cuttlefish
>
> Have you tried setting osd_recovery_clone_overlap to false?  That seemed to help with Stefan's issue.
> -Sam
>
> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> wrote:
>> Sam/Josh,
>>
>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
>> morning, hoping it would improve this situation, but there was no appreciable change.
>>
>> One node in our cluster fsck'ed after a reboot and got a bit behind.
>> Our instances backed by RBD volumes were OK at that point, but once
>> the node booted fully and the OSDs started, all Windows instances with
>> rbd volumes experienced very choppy performance and were unable to
>> ingest video surveillance traffic and commit it to disk. Once the
>> cluster got back to HEALTH_OK, they resumed normal operation.
>>
>> I tried for a time with conservative recovery settings (osd max
>> backfills = 1, osd recovery op priority = 1, and osd recovery max
>> active = 1). No improvement for the guests. So I went to more
>> aggressive settings to get things moving faster. That decreased the duration of the outage.
>>
>> During the entire period of recovery/backfill, the network looked
>> fine...no where close to saturation. iowait on all drives look fine as well.
>>
>> Any ideas?
>>
>> Thanks,
>> Mike Dawson
>>
>>
>>
>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:
>>>
>>> the same problem still occours. Will need to check when i've time to
>>> gather logs again.
>>>
>>> Am 14.08.2013 01:11, schrieb Samuel Just:
>>>>
>>>> I'm not sure, but your logs did show that you had >16 recovery ops
>>>> in flight, so it's worth a try.  If it doesn't help, you should
>>>> collect the same set of logs I'll look again.  Also, there are a few
>>>> other patches between 61.7 and current cuttlefish which may help.
>>>> -Sam
>>>>
>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>
>>>>>
>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>>
>>>>>> I just backported a couple of patches from next to fix a bug where
>>>>>> we weren't respecting the osd_recovery_max_active config in some
>>>>>> cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either
>>>>>> try the current cuttlefish branch or wait for a 61.8 release.
>>>>>
>>>>>
>>>>> Thanks! Are you sure that this is the issue? I don't believe that
>>>>> but i'll give it a try. I already tested a branch from sage where
>>>>> he fixed a race regarding max active some weeks ago. So active
>>>>> recovering was max 1 but the issue didn't went away.
>>>>>
>>>>> Stefan
>>>>>
>>>>>> -Sam
>>>>>>
>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just
>>>>>> <sam.just@xxxxxxxxxxx>
>>>>>> wrote:
>>>>>>>
>>>>>>> I got swamped today.  I should be able to look tomorrow.  Sorry!
>>>>>>> -Sam
>>>>>>>
>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>> Did you take a look?
>>>>>>>>
>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>>>>>
>>>>>>>>> Great!  I'll take a look on Monday.
>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
>>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Samual,
>>>>>>>>>>
>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just:
>>>>>>>>>>
>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's.
>>>>>>>>>>>
>>>>>>>>>>> Stefan: Can you reproduce the problem with
>>>>>>>>>>>
>>>>>>>>>>> debug osd = 20
>>>>>>>>>>> debug filestore = 20
>>>>>>>>>>> debug ms = 1
>>>>>>>>>>> debug optracker = 20
>>>>>>>>>>>
>>>>>>>>>>> on a few osds (including the restarted osd), and upload those
>>>>>>>>>>> osd logs along with the ceph.log from before killing the osd
>>>>>>>>>>> until after the cluster becomes clean again?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> done - you'll find the logs at cephdrop folder:
>>>>>>>>>> slow_requests_recovering_cuttlefish
>>>>>>>>>>
>>>>>>>>>> osd.52 was the one recovering
>>>>>>>>>>
>>>>>>>>>> Thanks!
>>>>>>>>>>
>>>>>>>>>> Greets,
>>>>>>>>>> Stefan
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>> ceph-devel" in the body of a message to
>>>>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at
>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe
>>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html