On Tue, Nov 15, 2016 at 8:40 AM, Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote: > > > On 15 November 2016 at 18:55, Vijay Bellur <vbellur@xxxxxxxxxx> wrote: >> >> On Mon, Nov 14, 2016 at 10:34 PM, Nithya Balachandran >> <nbalacha@xxxxxxxxxx> wrote: >> > >> > >> > On 14 November 2016 at 21:38, Vijay Bellur <vbellur@xxxxxxxxxx> wrote: >> >> >> >> I would prefer that we disable dbench only if we have an owner for >> >> fixing the problem and re-enabling it as part of smoke tests. Running >> >> dbench seamlessly on gluster has worked for a long while and if it is >> >> failing today, we need to address this regression asap. >> >> >> >> Does anybody have more context or clues on why dbench is failing now? >> >> >> > While I agree that it needs to be looked at asap, leaving it in until we >> > get >> > an owner seems rather pointless as all it does is hold up various >> > patches >> > and waste machine time. Re-triggering it multiple times so that it >> > eventually passes does not add anything to the regression test processes >> > or >> > validate the patch as we know there is a problem. >> > >> > I would vote for removing it and assigning someone to look at it >> > immediately. >> > >> >> From the debugging done so far can we identify an owner to whom this >> can be assigned? I looked around for related discussions and could >> figure out that we are looking to get statedumps. Do we have more >> information/context beyond this? >> > I have updated the BZ (https://bugzilla.redhat.com/show_bug.cgi?id=1379228) > with info from the last failure - looks like hangs in write-behind and > read-ahead. > I spent some time on this today and it does look like write-behind is absorbing READs without performing any WIND/UNWIND actions. I have attached a statedump from a slave that had the dbench problem (thanks, Nigel!) to the above bug. Snip from statedump: [global.callpool.stack.2] stack=0x7fd970002cdc uid=0 gid=0 pid=31884 unique=37870 lk-owner=0000000000000000 op=READ type=1 cnt=2 [global.callpool.stack.2.frame.1] frame=0x7fd9700036ac ref_count=0 translator=patchy-read-ahead complete=0 parent=patchy-readdir-ahead wind_from=ra_page_fault wind_to=FIRST_CHILD (fault_frame->this)->fops->readv unwind_to=ra_fault_cbk [global.callpool.stack.2.frame.2] frame=0x7fd97000346c ref_count=1 translator=patchy-readdir-ahead complete=0 Note that the frame which was wound from ra_page_fault() to write-behind is not yet complete and write-behind has not progressed the call. There are several callstacks with a similar signature in statedump. In write-behind's readv implementation, we stub READ fops and enqueue them in the relevant inode context. Once enqueued the stub resumes when appropriate set of conditions happen in write-behind. This is not happening now and I am not certain if: - READ fops are languishing in a queue and not being resumed or - READ fops are pre-maturely dropped from a queue without winding or unwinding When I gdb'd into the client process and examined the inode contexts for write-behind, I found all queues to be empty. This seems to indicate that the latter reason is more plausible but I have not yet found a code path to account for this possibility. One approach to proceed further is to add more logs in write-behind to get a better understanding of the problem. I will try that out sometime later this week. We are also considering disabling write-behind for smoke tests in the interim after a trial run (with write-behind disabled) later in the day. Thanks, Vijay _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel