----- Original Message ----- > From: "Vijay Bellur" <vbellur@xxxxxxxxxx> > To: "Nithya Balachandran" <nbalacha@xxxxxxxxxx> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Wednesday, November 16, 2016 9:41:12 AM > Subject: Re: Upstream smoke test failures > > On Tue, Nov 15, 2016 at 8:40 AM, Nithya Balachandran > <nbalacha@xxxxxxxxxx> wrote: > > > > > > On 15 November 2016 at 18:55, Vijay Bellur <vbellur@xxxxxxxxxx> wrote: > >> > >> On Mon, Nov 14, 2016 at 10:34 PM, Nithya Balachandran > >> <nbalacha@xxxxxxxxxx> wrote: > >> > > >> > > >> > On 14 November 2016 at 21:38, Vijay Bellur <vbellur@xxxxxxxxxx> wrote: > >> >> > >> >> I would prefer that we disable dbench only if we have an owner for > >> >> fixing the problem and re-enabling it as part of smoke tests. Running > >> >> dbench seamlessly on gluster has worked for a long while and if it is > >> >> failing today, we need to address this regression asap. > >> >> > >> >> Does anybody have more context or clues on why dbench is failing now? > >> >> > >> > While I agree that it needs to be looked at asap, leaving it in until we > >> > get > >> > an owner seems rather pointless as all it does is hold up various > >> > patches > >> > and waste machine time. Re-triggering it multiple times so that it > >> > eventually passes does not add anything to the regression test processes > >> > or > >> > validate the patch as we know there is a problem. > >> > > >> > I would vote for removing it and assigning someone to look at it > >> > immediately. > >> > > >> > >> From the debugging done so far can we identify an owner to whom this > >> can be assigned? I looked around for related discussions and could > >> figure out that we are looking to get statedumps. Do we have more > >> information/context beyond this? > >> > > I have updated the BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1379228) > > with info from the last failure - looks like hangs in write-behind and > > read-ahead. > > > > > I spent some time on this today and it does look like write-behind is > absorbing READs without performing any WIND/UNWIND actions. I have > attached a statedump from a slave that had the dbench problem (thanks, > Nigel!) to the above bug. > > Snip from statedump: > > [global.callpool.stack.2] > stack=0x7fd970002cdc > uid=0 > gid=0 > pid=31884 > unique=37870 > lk-owner=0000000000000000 > op=READ > type=1 > cnt=2 > > [global.callpool.stack.2.frame.1] > frame=0x7fd9700036ac > ref_count=0 > translator=patchy-read-ahead > complete=0 > parent=patchy-readdir-ahead > wind_from=ra_page_fault > wind_to=FIRST_CHILD (fault_frame->this)->fops->readv > unwind_to=ra_fault_cbk > > [global.callpool.stack.2.frame.2] > frame=0x7fd97000346c > ref_count=1 > translator=patchy-readdir-ahead > complete=0 > > > Note that the frame which was wound from ra_page_fault() to > write-behind is not yet complete and write-behind has not progressed > the call. There are several callstacks with a similar signature in > statedump. I think the culprit here is read-ahead, not write-behind. If read fop was dropped in write-behind, we should've seen a frame associated with write-behind (complete=0 for a frame associated with a xlator indicates frame was not unwound from _that_ xlator). But I didn't see any. Also empty request queues in wb_inode corroborate the hypothesis. Karthick subrahmanya is working on a similar issue reported by a user. However, we've not made much of a progress till now. > > In write-behind's readv implementation, we stub READ fops and enqueue > them in the relevant inode context. Once enqueued the stub resumes > when appropriate set of conditions happen in write-behind. This is not > happening now and I am not certain if: > > - READ fops are languishing in a queue and not being resumed or > - READ fops are pre-maturely dropped from a queue without winding or > unwinding > > When I gdb'd into the client process and examined the inode contexts > for write-behind, I found all queues to be empty. This seems to > indicate that the latter reason is more plausible but I have not yet > found a code path to account for this possibility. > > One approach to proceed further is to add more logs in write-behind to > get a better understanding of the problem. I will try that out > sometime later this week. We are also considering disabling > write-behind for smoke tests in the interim after a trial run (with > write-behind disabled) later in the day. > > Thanks, > Vijay > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel