Re: still recovery issues with cuttlefish

Samuel Just <sam.just@xxxxxxxxxxx> · Wed, 21 Aug 2013 08:32:30 -0700



Have you tried setting osd_recovery_clone_overlap to false?  That
seemed to help with Stefan's issue.
-Sam

On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.dawson@xxxxxxxxxxxx> wrote:
> Sam/Josh,
>
> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this morning,
> hoping it would improve this situation, but there was no appreciable change.
>
> One node in our cluster fsck'ed after a reboot and got a bit behind. Our
> instances backed by RBD volumes were OK at that point, but once the node
> booted fully and the OSDs started, all Windows instances with rbd volumes
> experienced very choppy performance and were unable to ingest video
> surveillance traffic and commit it to disk. Once the cluster got back to
> HEALTH_OK, they resumed normal operation.
>
> I tried for a time with conservative recovery settings (osd max backfills =
> 1, osd recovery op priority = 1, and osd recovery max active = 1). No
> improvement for the guests. So I went to more aggressive settings to get
> things moving faster. That decreased the duration of the outage.
>
> During the entire period of recovery/backfill, the network looked fine...no
> where close to saturation. iowait on all drives look fine as well.
>
> Any ideas?
>
> Thanks,
> Mike Dawson
>
>
>
> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:
>>
>> the same problem still occours. Will need to check when i've time to
>> gather logs again.
>>
>> Am 14.08.2013 01:11, schrieb Samuel Just:
>>>
>>> I'm not sure, but your logs did show that you had >16 recovery ops in
>>> flight, so it's worth a try.  If it doesn't help, you should collect
>>> the same set of logs I'll look again.  Also, there are a few other
>>> patches between 61.7 and current cuttlefish which may help.
>>> -Sam
>>>
>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>
>>>>
>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>
>>>>> I just backported a couple of patches from next to fix a bug where we
>>>>> weren't respecting the osd_recovery_max_active config in some cases
>>>>> (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either try the
>>>>> current cuttlefish branch or wait for a 61.8 release.
>>>>
>>>>
>>>> Thanks! Are you sure that this is the issue? I don't believe that but
>>>> i'll give it a try. I already tested a branch from sage where he fixed a
>>>> race regarding max active some weeks ago. So active recovering was max 1 but
>>>> the issue didn't went away.
>>>>
>>>> Stefan
>>>>
>>>>> -Sam
>>>>>
>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just <sam.just@xxxxxxxxxxx>
>>>>> wrote:
>>>>>>
>>>>>> I got swamped today.  I should be able to look tomorrow.  Sorry!
>>>>>> -Sam
>>>>>>
>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> Did you take a look?
>>>>>>>
>>>>>>> Stefan
>>>>>>>
>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.just@xxxxxxxxxxx>:
>>>>>>>
>>>>>>>> Great!  I'll take a look on Monday.
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
>>>>>>>> <s.priebe@xxxxxxxxxxxx> wrote:
>>>>>>>>>
>>>>>>>>> Hi Samual,
>>>>>>>>>
>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just:
>>>>>>>>>
>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's.
>>>>>>>>>>
>>>>>>>>>> Stefan: Can you reproduce the problem with
>>>>>>>>>>
>>>>>>>>>> debug osd = 20
>>>>>>>>>> debug filestore = 20
>>>>>>>>>> debug ms = 1
>>>>>>>>>> debug optracker = 20
>>>>>>>>>>
>>>>>>>>>> on a few osds (including the restarted osd), and upload those osd
>>>>>>>>>> logs
>>>>>>>>>> along with the ceph.log from before killing the osd until after
>>>>>>>>>> the
>>>>>>>>>> cluster becomes clean again?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> done - you'll find the logs at cephdrop folder:
>>>>>>>>> slow_requests_recovering_cuttlefish
>>>>>>>>>
>>>>>>>>> osd.52 was the one recovering
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> Greets,
>>>>>>>>> Stefan
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel" in
>>>>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in
>>>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html