RE: still recovery issues with cuttlefish

Yann ROBIN <yann.robin@xxxxxxxxxxxxx> · Wed, 21 Aug 2013 16:25:46 +0000



It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401)

-----Original Message-----
From: ceph-devel-owner@xxxxxxxxxxxxxxx [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] On Behalf Of Samuel Just
Sent: mercredi 21 août 2013 17:33
To: Mike Dawson
Cc: Stefan Priebe - Profihost AG; josh.durgin@xxxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx
Subject: Re: still recovery issues with cuttlefish

Have you tried setting osd_recovery_clone_overlap to false?  That seemed to help with Stefan's issue.
-Sam

On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> wrote:
> Sam/Josh,
>
> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this 
> morning, hoping it would improve this situation, but there was no appreciable change.
>
> One node in our cluster fsck'ed after a reboot and got a bit behind. 
> Our instances backed by RBD volumes were OK at that point, but once 
> the node booted fully and the OSDs started, all Windows instances with 
> rbd volumes experienced very choppy performance and were unable to 
> ingest video surveillance traffic and commit it to disk. Once the 
> cluster got back to HEALTH_OK, they resumed normal operation.
>
> I tried for a time with conservative recovery settings (osd max 
> backfills = 1, osd recovery op priority = 1, and osd recovery max 
> active = 1). No improvement for the guests. So I went to more 
> aggressive settings to get things moving faster. That decreased the duration of the outage.
>
> During the entire period of recovery/backfill, the network looked 
> fine...no where close to saturation. iowait on all drives look fine as well.
>
> Any ideas?
>
> Thanks,
> Mike Dawson
>
>
>
> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:
>>
>> the same problem still occours. Will need to check when i've time to 
>> gather logs again.
>>
>> Am 14.08.2013 01:11, schrieb Samuel Just:
>>>
>>> I'm not sure, but your logs did show that you had >16 recovery ops 
>>> in flight, so it's worth a try.  If it doesn't help, you should 
>>> collect the same set of logs I'll look again.  Also, there are a few 
>>> other patches between 61.7 and current cuttlefish which may help.
>>> -Sam
>>>
>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG 
>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>
>>>>> I just backported a couple of patches from next to fix a bug where 
>>>>> we weren't respecting the osd_recovery_max_active config in some 
>>>>> cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either 
>>>>> try the current cuttlefish branch or wait for a 61.8 release.
>>>>
>>>>
>>>> Thanks! Are you sure that this is the issue? I don't believe that 
>>>> but i'll give it a try. I already tested a branch from sage where 
>>>> he fixed a race regarding max active some weeks ago. So active 
>>>> recovering was max 1 but the issue didn't went away.
>>>>
>>>> Stefan
>>>>
>>>>> -Sam
>>>>>
>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just 
>>>>> <sam.just@xxxxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> I got swamped today.  I should be able to look tomorrow.  Sorry!
>>>>>> -Sam
>>>>>>
>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG 
>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Did you take a look?
>>>>>>>
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>>>>
>>>>>>>> Great!  I'll take a look on Monday.
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe 
>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Hi Samual,
>>>>>>>>>
>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just:
>>>>>>>>>
>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's.
>>>>>>>>>>
>>>>>>>>>> Stefan: Can you reproduce the problem with
>>>>>>>>>>
>>>>>>>>>> debug osd = 20
>>>>>>>>>> debug filestore = 20
>>>>>>>>>> debug ms = 1
>>>>>>>>>> debug optracker = 20
>>>>>>>>>>
>>>>>>>>>> on a few osds (including the restarted osd), and upload those 
>>>>>>>>>> osd logs along with the ceph.log from before killing the osd 
>>>>>>>>>> until after the cluster becomes clean again?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> done - you'll find the logs at cephdrop folder:
>>>>>>>>> slow_requests_recovering_cuttlefish
>>>>>>>>>
>>>>>>>>> osd.52 was the one recovering
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe 
>>>>>>>> ceph-devel" in the body of a message to 
>>>>>>>> majordomo@xxxxxxxxxxxxxxx More majordomo info at  
>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo 
>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe 
>>> ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx 
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html