Exactly what kernel are you using? -Sam On Mon, Nov 2, 2015 at 6:14 PM, Samuel Just <sjust@xxxxxxxxxx> wrote: > Yeah, there's a heartbeat system and the messenger is reliable delivery. > -Sam > > On Mon, Nov 2, 2015 at 5:41 PM, yangruifeng.09209@xxxxxxx > <yangruifeng.09209@xxxxxxx> wrote: >> I will try my best to get the detailed log. >> In the current version, we can ensure the messages that are related to peering is correctly received by peers? >> >> thanks >> Ruifeng Yang. >> >> -----邮件原件----- >> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx] >> 发送时间: 2015年11月3日 9:28 >> 收件人: yangruifeng 09209 (RD) >> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx); ceph-devel@xxxxxxxxxxxxxxx >> 主题: Re: 答复: 答复: 答复: 答复: another peering stuck caused by net problem. >> >> Temporary network failures should be handled correctly. The best solution is to actually fix that bug then. Capture logging on all involved osds while it is hung and open a bug: >> >> debug osd = 20 >> debug filestore = 20 >> debug ms = 1 >> -Sam >> >> On Mon, Nov 2, 2015 at 5:24 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote: >>> a unknown reason problem, which cause pg stuck in peering, may be a temporary failure network failure or other bug. >>> BUT it can be solved by *manual* 'ceph osd down <osdid>' >>> >>> -----邮件原件----- >>> 发件人: ceph-devel-owner@xxxxxxxxxxxxxxx >>> [mailto:ceph-devel-owner@xxxxxxxxxxxxxxx] 代表 Samuel Just >>> 发送时间: 2015年11月3日 9:12 >>> 收件人: yangruifeng 09209 (RD) >>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx); >>> ceph-devel@xxxxxxxxxxxxxxx >>> 主题: Re: 答复: 答复: 答复: another peering stuck caused by net problem. >>> >>> The problem is that peering shouldn't hang for no reason. If you are >>> seeing peering hang for a long time either >>> 1) you are hitting a peering bug which we need to track down and fix >>> 2) peering actually cannot make progress. >>> >>> In case 1, it can be nice to have a work around to force peering to restart and avoid the bug. However, case 2 would not be helped by restarting peering, you'd just end up in the same place. If you did it based on a timeout, you'd just increase load by a ton when in that situation. What problem are you trying to solve? >>> -Sam >>> >>> On Mon, Nov 2, 2015 at 5:05 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote: >>>> ok. >>>> >>>> thanks >>>> Ruifeng Yang >>>> >>>> -----邮件原件----- >>>> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx] >>>> 发送时间: 2015年11月3日 9:03 >>>> 收件人: yangruifeng 09209 (RD) >>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx) >>>> 主题: Re: 答复: 答复: another peering stuck caused by net problem. >>>> >>>> Would it be ok if I reply to the list as well? >>>> -Sam >>>> >>>> On Mon, Nov 2, 2015 at 4:37 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote: >>>>> the cluster is maybe always peering in same exceptional cases, but >>>>> it can return to normal by *manual* 'ceph osd down <osdid>', this is >>>>> not convenient in a production environment, and against the concept of rados. >>>>> add a timeout mechanism to kick it, or kick it when io hang, maybe reasonable? >>>>> >>>>> thanks, >>>>> Ruifeng Yang >>>>> >>>>> -----邮件原件----- >>>>> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx] >>>>> 发送时间: 2015年11月3日 2:21 >>>>> 收件人: yangruifeng 09209 (RD) >>>>> 抄送: chenxiaowei 11245 (RD); Sage Weil (sweil@xxxxxxxxxx) >>>>> 主题: Re: 答复: another peering stuck caused by net problem. >>>>> >>>>> I mean issue 'ceph osd down <osdid>' for the primary on the pg. But that only causes peering to restart. If peering stalled previously, it'll probably stall again. What are you trying to accomplish? >>>>> -Sam >>>>> >>>>> On Fri, Oct 30, 2015 at 5:51 PM, yangruifeng.09209@xxxxxxx <yangruifeng.09209@xxxxxxx> wrote: >>>>>> do you mean restart primary osd? or any other command? >>>>>> >>>>>> thanks >>>>>> Ruifeng Yang >>>>>> >>>>>> -----邮件原件----- >>>>>> 发件人: Samuel Just [mailto:sjust@xxxxxxxxxx] >>>>>> 发送时间: 2015年10月30日 23:07 >>>>>> 收件人: chenxiaowei 11245 (RD) >>>>>> 抄送: Sage Weil (sweil@xxxxxxxxxx); yangruifeng 09209 (RD) >>>>>> 主题: Re: another peering stuck caused by net problem. >>>>>> >>>>>> How would that help? As a way to work around a possible bug? You can accomplish pretty much the same thing by setting the primary down. >>>>>> -Sam >>>>>> >>>>>> On Wed, Oct 28, 2015 at 8:22 PM, Chenxiaowei <chen.xiaowei@xxxxxxx> wrote: >>>>>>> Hi, Samuel&Sage: >>>>>>> I am cxwshawn from H3C(belong to HP), the pg peering stuck >>>>>>> problem is a serious problem especially under the production environment, So here we came up two solutions: >>>>>>> if state Peering stuck too long, we can check timeout >>>>>>> exceeds to force transition from Peering to Reset state, Or we can add a command line to force one pg from Peering stuck to Reset state. >>>>>>> >>>>>>> What's your advice? Wish your reply >>>>>>> >>>>>>> Yours >>>>>>> shawn from Beijing, China. >>>>>>> >>>>>>> ------------------------------------------------------------------ >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> --------------------------------------------------------------- >>>>>>> 本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出 >>>>>>> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、 >>>>>>> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本 >>>>>>> 邮件! >>>>>>> This e-mail and its attachments contain confidential information >>>>>>> from H3C, which is intended only for the person or entity whose >>>>>>> address is listed above. Any use of the information contained >>>>>>> herein in any way (including, but not limited to, total or partial >>>>>>> disclosure, reproduction, or dissemination) by persons other than >>>>>>> the intended >>>>>>> recipient(s) is prohibited. If you receive this e-mail in error, >>>>>>> please notify the sender by phone or email immediately and delete it! >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>> in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo >>> info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html