Fwd: A bug may cause a request be executed twice in mClockQueue

韦皓诚 <whc0000001@xxxxxxxxx> · Tue, 16 Oct 2018 15:47:38 +0800



>
> Thanks for apply. I am sorry for communicating in three emails. I hava
> not set allow_limit_break false. And after reading the code again， I
> think the cause of the core dump may be  that I try to set
> "osd_op_queue_mclock_scrub_wgt" to a extremely small value to restrict
> scrub. So the part about allow_limit_break have not work.
>
> if (allow_limit_break) {
>   if (readys.has_request() &&
>       readys.next_request().tag.proportion < max_tag) {
>     result.type = NextReqType::returning;
>     result.heap_id = HeapId::ready;
>     return result;
>   } else if (reserv.has_request() &&
>      reserv.next_request().tag.reservation < max_tag) {
>     result.type = NextReqType::returning;
>     result.heap_id = HeapId::reservation;
>     return result;
>   }
> }
>
> So I wonder if it means that the limit tag is not yet available?
> Thank you again !
> J. Eric Ivancich <ivancich@xxxxxxxxxx> 于2018年10月16日周二 上午2:04写道：
> >
> > This topic is now spread across three threads in ceph-devel. Can we keep
> > it in a single thread? That would help me, so I won't have to reply the
> > same way multiple times.
> >
> > In another of the threads I asked you if you'd modified the source code
> > to turn off allow_limit_break. Because 12.2.4 has allow_limit_break set
> > to true (later versions support three states and use an enum class to
> > distinguish them). And if that's the case
> > PullPriorityQueue::pull_request should not be returning a "future",
> > unless there's a bug.
> >
> > I'm happy to track down the bug, but I'd first need to know whether the
> > source code has been modified to create this situation.
> >
> > Eric
> >
> > On 10/11/18 5:47 AM, 韦皓诚 wrote:
> > > Yes, I agree.
> > > kefu chai <tchaikov@xxxxxxxxx> 于2018年10月11日周四 下午5:25写道：
> > >>
> > >> On Thu, Oct 11, 2018 at 1:19 PM 韦皓诚 <whc0000001@xxxxxxxxx> wrote:
> > >>>
> > >>> Hi guys
> > >>> Class mClockQueue calls  PullPriorityQueue::pull_request in function
> > >>> dequeue() . But PullPriorityQueue::pull_request may return a "future"
> > >>> value. That means the mclock tag of the request are greater than now,
> > >>> so it should be executed in the future instead of now.The mistake is
> > >>> that the mClockQueue::dequeue() do not consider about the type of
> > >>> return value
> > >>
> > >> in that case, the returned PullReq is a "future" instead of a "retn",
> > >> so ceph_assert() will abort OSD if a "future" is returned. but i agree
> > >> with you, probably we need to wait for "crimson::dmclock::get_time() -
> > >> pr.getTime()" before retrying.
> > >>
> > >>> and execute the request immediately. What is more serious is that
> > >>> PullPriorityQueue remains the request in queue and it would be
> > >>> executed twice in the future.
> > >>>
> > >>>
> > >>>                              From Weihaocheng
> > >>
> > >>
> > >>
> > >> --
> > >> Regards
> > >> Kefu Chai
> >