Re: Upstream smoke test failures

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 22 November 2016 at 13:09, Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote:


----- Original Message -----
> From: "Vijay Bellur" <vbellur@xxxxxxxxxx>
> To: "Nithya Balachandran" <nbalacha@xxxxxxxxxx>
> Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
> Sent: Wednesday, November 16, 2016 9:41:12 AM
> Subject: Re: Upstream smoke test failures
>
> On Tue, Nov 15, 2016 at 8:40 AM, Nithya Balachandran
> <nbalacha@xxxxxxxxxx> wrote:
> >
> >
> > On 15 November 2016 at 18:55, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:
> >>
> >> On Mon, Nov 14, 2016 at 10:34 PM, Nithya Balachandran
> >> <nbalacha@xxxxxxxxxx> wrote:
> >> >
> >> >
> >> > On 14 November 2016 at 21:38, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:
> >> >>
> >> >> I would prefer that we disable dbench only if we have an owner for
> >> >> fixing the problem and re-enabling it as part of smoke tests. Running
> >> >> dbench seamlessly on gluster has worked for a long while and if it is
> >> >> failing today, we need to address this regression asap.
> >> >>
> >> >> Does anybody have more context or clues on why dbench is failing now?
> >> >>
> >> > While I agree that it needs to be looked at asap, leaving it in until we
> >> > get
> >> > an owner seems rather pointless as all it does is hold up various
> >> > patches
> >> > and waste machine time. Re-triggering it multiple times so that it
> >> > eventually passes does not add anything to the regression test processes
> >> > or
> >> > validate the patch as we know there is a problem.
> >> >
> >> > I would vote for removing it and assigning someone to look at it
> >> > immediately.
> >> >
> >>
> >> From the debugging done so far can we identify an owner to whom this
> >> can be assigned? I looked around for related discussions and could
> >> figure out that we are looking to get statedumps. Do we have more
> >> information/context beyond this?
> >>
> > I have updated the BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1379228)
> > with info from the last failure - looks like hangs in write-behind and
> > read-ahead.
> >
>
>
> I spent some time on this today and it does look like write-behind is
> absorbing READs without performing any WIND/UNWIND actions. I have
> attached a statedump from a slave that had the dbench problem (thanks,
> Nigel!) to the above bug.
>
> Snip from statedump:
>
> [global.callpool.stack.2]
> stack=0x7fd970002cdc
> uid=0
> gid=0
> pid=31884
> unique=37870
> lk-owner=0000000000000000
> op=READ
> type=1
> cnt=2
>
> [global.callpool.stack.2.frame.1]
> frame=0x7fd9700036ac
> ref_count=0
> translator=patchy-read-ahead
> complete=0
> parent=patchy-readdir-ahead
> wind_from=ra_page_fault
> wind_to=FIRST_CHILD (fault_frame->this)->fops->readv
> unwind_to=ra_fault_cbk
>
> [global.callpool.stack.2.frame.2]
> frame=0x7fd97000346c
> ref_count=1
> translator=patchy-readdir-ahead
> complete=0
>
>
> Note that the frame which was wound from ra_page_fault() to
> write-behind is not yet complete and write-behind has not progressed
> the call. There are several callstacks with a similar signature in
> statedump.

I think the culprit here is read-ahead, not write-behind. If read fop was dropped in write-behind, we should've seen a frame associated with write-behind (complete=0 for a frame associated with a xlator indicates frame was not unwound from _that_ xlator). But I didn't see any. Also empty request queues in wb_inode corroborate the hypothesis. K

 
We have seen both . See comment#17 in https://bugzilla.redhat.com/show_bug.cgi?id=1379228 .


regards,
Nithya


arthick subrahmanya is working on a similar issue reported by a user. However, we've not made much of a progress till now.

>
> In write-behind's readv implementation, we stub READ fops and enqueue
> them in the relevant inode context. Once enqueued the stub resumes
> when appropriate set of conditions happen in write-behind. This is not
> happening now and  I am not certain if:
>
> - READ fops are languishing in a queue and not being resumed or
> - READ fops are pre-maturely dropped from a queue without winding or
> unwinding
>
> When I gdb'd into the client process and examined the inode contexts
> for write-behind, I found all queues to be empty. This seems to
> indicate that the latter reason is more plausible but I have not yet
> found a code path to account for this possibility.
>
> One approach to proceed further is to add more logs in write-behind to
> get a better understanding of the problem. I will try that out
> sometime later this week. We are also considering disabling
> write-behind for smoke tests in the interim after a trial run (with
> write-behind disabled) later in the day.
>
> Thanks,
> Vijay
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-devel
>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux