Re: 答复: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.

Samuel Just <sjust@xxxxxxxxxx> · Mon, 2 Nov 2015 18:15:10 -0800



Exactly what kernel are you using?
-Sam

On Mon, Nov 2, 2015 at 6:14 PM, Samuel Just <sjust@xxxxxxxxxx> wrote:
> Yeah, there's a heartbeat system and the messenger is reliable delivery.
> -Sam
>
> On Mon, Nov 2, 2015 at 5:41 PM, yangruifeng.09209@xxxxxxx
> <yangruifeng.09209@xxxxxxx> wrote:
>> I will try my best to get the detailed log.
>> In the current version, we can ensure the messages that are related to peering is correctly received by peers?
>>
>> thanks
>> Ruifeng Yang.
>>
>> -----邮件原件-----
>> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx]
>> 发送时间: 2015年11月3日 9:28
>> 收件人: yangruifeng 09209 (RD)
>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx); ceph-devel@xxxxxxxxxxxxxxx
>> 主题: Re: 答复: 答复: 答复: 答复: another peering stuck caused by net problem.
>>
>> Temporary network failures should be handled correctly.  The best solution is to actually fix that bug then.  Capture logging on all involved osds while it is hung and open a bug:
>>
>> debug osd = 20
>> debug filestore = 20
>> debug ms = 1
>> -Sam
>>
>> On Mon, Nov 2, 2015 at 5:24 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote:
>>> a unknown reason problem, which cause pg stuck in peering, may be a temporary failure network failure or other bug.
>>> BUT it can be solved by *manual* 'ceph osd down <osdid>'
>>>
>>> -----邮件原件-----
>>> 发件人: ceph-devel-owner@xxxxxxxxxxxxxxx
>>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] 代表 Samuel Just
>>> 发送时间: 2015年11月3日 9:12
>>> 收件人: yangruifeng 09209 (RD)
>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx);
>>> ceph-devel@xxxxxxxxxxxxxxx
>>> 主题: Re: 答复: 答复: 答复: another peering stuck caused by net problem.
>>>
>>> The problem is that peering shouldn't hang for no reason.  If you are
>>> seeing peering hang for a long time either
>>> 1) you are hitting a peering bug which we need to track down and fix
>>> 2) peering actually cannot make progress.
>>>
>>> In case 1, it can be nice to have a work around to force peering to restart and avoid the bug.  However, case 2 would not be helped by restarting peering, you'd just end up in the same place.  If you did it based on a timeout, you'd just increase load by a ton when in that situation.  What problem are you trying to solve?
>>> -Sam
>>>
>>> On Mon, Nov 2, 2015 at 5:05 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote:
>>>> ok.
>>>>
>>>> thanks
>>>> Ruifeng Yang
>>>>
>>>> -----邮件原件-----
>>>> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx]
>>>> 发送时间: 2015年11月3日 9:03
>>>> 收件人: yangruifeng 09209 (RD)
>>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx)
>>>> 主题: Re: 答复: 答复: another peering stuck caused by net problem.
>>>>
>>>> Would it be ok if I reply to the list as well?
>>>> -Sam
>>>>
>>>> On Mon, Nov 2, 2015 at 4:37 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote:
>>>>> the cluster is maybe always peering in same exceptional cases, but
>>>>> it can return to normal by *manual* 'ceph osd down <osdid>', this is
>>>>> not convenient in a production environment, and against the concept of rados.
>>>>> add a timeout mechanism to kick it, or kick it when io hang, maybe reasonable?
>>>>>
>>>>> thanks,
>>>>> Ruifeng Yang
>>>>>
>>>>> -----邮件原件-----
>>>>> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx]
>>>>> 发送时间: 2015年11月3日 2:21
>>>>> 收件人: yangruifeng 09209 (RD)
>>>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx)
>>>>> 主题: Re: 答复: another peering stuck caused by net problem.
>>>>>
>>>>> I mean issue 'ceph osd down <osdid>' for the primary on the pg.  But that only causes peering to restart.  If peering stalled previously, it'll probably stall again.  What are you trying to accomplish?
>>>>> -Sam
>>>>>
>>>>> On Fri, Oct 30, 2015 at 5:51 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote:
>>>>>> do you mean restart primary osd? or any other command？
>>>>>>
>>>>>> thanks
>>>>>> Ruifeng Yang
>>>>>>
>>>>>> -----邮件原件-----
>>>>>> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx]
>>>>>> 发送时间: 2015年10月30日 23:07
>>>>>> 收件人: chenxiaowei 11245 (RD)
>>>>>> 抄送: Sage Weil (sweil@xxxxxxxxxx); yangruifeng 09209 (RD)
>>>>>> 主题: Re: another peering stuck caused by net problem.
>>>>>>
>>>>>> How would that help?  As a way to work around a possible bug?  You can accomplish pretty much the same thing by setting the primary down.
>>>>>> -Sam
>>>>>>
>>>>>> On Wed, Oct 28, 2015 at 8:22 PM, Chenxiaowei <chen.xiaowei@xxxxxxx> wrote:
>>>>>>> Hi, Samuel&Sage:
>>>>>>>         I am cxwshawn from H3C(belong to HP), the pg peering stuck
>>>>>>> problem is a serious problem especially under the production environment, So here we came up two solutions:
>>>>>>>         if state Peering stuck too long, we can check timeout
>>>>>>> exceeds to force transition from Peering to Reset state, Or we can add a command line to force one pg from Peering stuck to Reset state.
>>>>>>>
>>>>>>> What's your advice? Wish your reply
>>>>>>>
>>>>>>> Yours
>>>>>>> shawn from Beijing, China.
>>>>>>>
>>>>>>> ------------------------------------------------------------------
>>>>>>> -
>>>>>>> -
>>>>>>> -
>>>>>>> -
>>>>>>> ---------------------------------------------------------------
>>>>>>> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息，仅限于发送给上面地址中列出
>>>>>>> 的个人或群组。禁止任何其他人以任何形式使用（包括但不限于全部或部分地泄露、复制、
>>>>>>> 或散发）本邮件中的信息。如果您错收了本邮件，请您立即电话或邮件通知发件人并删除本
>>>>>>> 邮件！
>>>>>>> This e-mail and its attachments contain confidential information
>>>>>>> from H3C, which is intended only for the person or entity whose
>>>>>>> address is listed above. Any use of the information contained
>>>>>>> herein in any way (including, but not limited to, total or partial
>>>>>>> disclosure, reproduction, or dissemination) by persons other than
>>>>>>> the intended
>>>>>>> recipient(s) is prohibited. If you receive this e-mail in error,
>>>>>>> please notify the sender by phone or email immediately and delete it!
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html