Re: Upstream smoke test failures

Poornima Gurusiddaiah <pgurusid@xxxxxxxxxx> · Tue, 22 Nov 2016 06:48:51 -0500 (EST)

Hi,

Have posted a fix for hang in read : http://review.gluster.org/15901
I think, it will fix the issue reported here. Please check the commit message of the patch
for more details.

Regards,
Poornima
From: "Nithya Balachandran" <nbalacha@xxxxxxxxxx>
To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
Sent: Tuesday, November 22, 2016 3:23:59 AM
Subject: Re:  Upstream smoke test failures

On 22 November 2016 at 13:09, Raghavendra Gowdappa <rgowdapp@xxxxxxxxxx> wrote:

 ----- Original Message -----
 > From: "Vijay Bellur" <vbellur@xxxxxxxxxx>
 > To: "Nithya Balachandran" <nbalacha@xxxxxxxxxx>
 > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx>
 > Sent: Wednesday, November 16, 2016 9:41:12 AM
 > Subject: Re:  Upstream smoke test failures
 >
 > On Tue, Nov 15, 2016 at 8:40 AM, Nithya Balachandran
 > <nbalacha@xxxxxxxxxx> wrote:
 > >
 > >
 > > On 15 November 2016 at 18:55, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:
 > >>
 > >> On Mon, Nov 14, 2016 at 10:34 PM, Nithya Balachandran
 > >> <nbalacha@xxxxxxxxxx> wrote:
 > >> >
 > >> >
 > >> > On 14 November 2016 at 21:38, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:
 > >> >>
 > >> >> I would prefer that we disable dbench only if we have an owner for
 > >> >> fixing the problem and re-enabling it as part of smoke tests. Running
 > >> >> dbench seamlessly on gluster has worked for a long while and if it is
 > >> >> failing today, we need to address this regression asap.
 > >> >>
 > >> >> Does anybody have more context or clues on why dbench is failing now?
 > >> >>
 > >> > While I agree that it needs to be looked at asap, leaving it in until we
 > >> > get
 > >> > an owner seems rather pointless as all it does is hold up various
 > >> > patches
 > >> > and waste machine time. Re-triggering it multiple times so that it
 > >> > eventually passes does not add anything to the regression test processes
 > >> > or
 > >> > validate the patch as we know there is a problem.
 > >> >
 > >> > I would vote for removing it and assigning someone to look at it
 > >> > immediately.
 > >> >
 > >>
 > >> From the debugging done so far can we identify an owner to whom this
 > >> can be assigned? I looked around for related discussions and could
 > >> figure out that we are looking to get statedumps. Do we have more
 > >> information/context beyond this?
 > >>
 > > I have updated the BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1379228)
 > > with info from the last failure - looks like hangs in write-behind and
 > > read-ahead.
 > >
 >
 >
 > I spent some time on this today and it does look like write-behind is
 > absorbing READs without performing any WIND/UNWIND actions. I have
 > attached a statedump from a slave that had the dbench problem (thanks,
 > Nigel!) to the above bug.
 >
 > Snip from statedump:
 >
 > [global.callpool.stack.2]
 > stack=0x7fd970002cdc
 > uid=0
 > gid=0
 > pid=31884
 > unique=37870
 > lk-owner=0000000000000000
 > op=READ
 > type=1
 > cnt=2
 >
 > [global.callpool.stack.2.frame.1]
 > frame=0x7fd9700036ac
 > ref_count=0
 > translator=patchy-read-ahead
 > complete=0
 > parent=patchy-readdir-ahead
 > wind_from=ra_page_fault
 > wind_to=FIRST_CHILD (fault_frame->this)->fops->readv
 > unwind_to=ra_fault_cbk
 >
 > [global.callpool.stack.2.frame.2]
 > frame=0x7fd97000346c
 > ref_count=1
 > translator=patchy-readdir-ahead
 > complete=0
 >
 >
 > Note that the frame which was wound from ra_page_fault() to
 > write-behind is not yet complete and write-behind has not progressed
 > the call. There are several callstacks with a similar signature in
 > statedump.

I think the culprit here is read-ahead, not write-behind. If read fop was dropped in write-behind, we should've seen a frame associated with write-behind (complete=0 for a frame associated with a xlator indicates frame was not unwound from _that_ xlator). But I didn't see any. Also empty request queues in wb_inode corroborate the hypothesis. K

We have seen both . See comment#17 in https://bugzilla.redhat.com/show_bug.cgi?id=1379228 .

regards,
Nithya

arthick subrahmanya is working on a similar issue reported by a user. However, we've not made much of a progress till now.

 >
 > In write-behind's readv implementation, we stub READ fops and enqueue
 > them in the relevant inode context. Once enqueued the stub resumes
 > when appropriate set of conditions happen in write-behind. This is not
 > happening now and  I am not certain if:
 >
 > - READ fops are languishing in a queue and not being resumed or
 > - READ fops are pre-maturely dropped from a queue without winding or
 > unwinding
 >
 > When I gdb'd into the client process and examined the inode contexts
 > for write-behind, I found all queues to be empty. This seems to
 > indicate that the latter reason is more plausible but I have not yet
 > found a code path to account for this possibility.
 >
 > One approach to proceed further is to add more logs in write-behind to
 > get a better understanding of the problem. I will try that out
 > sometime later this week. We are also considering disabling
 > write-behind for smoke tests in the interim after a trial run (with
 > write-behind disabled) later in the day.
 >
 > Thanks,
 > Vijay
 > _______________________________________________
 > Gluster-devel mailing list
 > Gluster-devel@xxxxxxxxxxx
 > http://www.gluster.org/mailman/listinfo/gluster-devel
 >

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel