Re: commit 01e99aeca397 causes longer runtime of block/004

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mar 06, 2020 / 16:13, Ming Lei wrote:
> On Fri, Mar 06, 2020 at 06:06:23AM +0000, Shinichiro Kawasaki wrote:
> > On Mar 05, 2020 / 10:48, Ming Lei wrote:
> > > Hi Shinichiro,
> > > 
> > > On Thu, Mar 05, 2020 at 01:19:02AM +0000, Shinichiro Kawasaki wrote:
> > > > On Mar 04, 2020 / 17:53, Ming Lei wrote:
> > > > > On Wed, Mar 04, 2020 at 06:11:37AM +0000, Shinichiro Kawasaki wrote:
> > > > > > On Mar 04, 2020 / 11:46, Ming Lei wrote:
> > > > > > > On Wed, Mar 04, 2020 at 02:38:43AM +0000, Shinichiro Kawasaki wrote:
> > > > > > > > I noticed that blktests block/004 takes longer runtime with 5.6-rc4 than
> > > > > > > > 5.6-rc3, and found that the commit 01e99aeca397 ("blk-mq: insert passthrough
> > > > > > > > request into hctx->dispatch directly") triggers it.
> > > > > > > > 
> > > > > > > > The longer runtime was observed with dm-linear device which maps SATA SMR HDD
> > > > > > > > connected via AHCI. It was not observed with dm-linear on SAS/SATA SMR HDDs
> > > > > > > > connected via SAS-HBA. Not observed with dm-linear on non-SMR HDDs either.
> > > > > > > > 
> > > > > > > > Before the commit, block/004 took around 130 seconds. After the commit, it takes
> > > > > > > > around 300 seconds. I need to dig in further details to understand why the
> > > > > > > > commit makes the test case longer.
> > > > > > > > 
> > > > > > > > The test case block/004 does "flush intensive workload". Is this longer runtime
> > > > > > > > expected?
> > > > > > > 
> > > > > > > The following patch might address this issue:
> > > > > > > 
> > > > > > > https://lore.kernel.org/linux-block/20200207190416.99928-1-sqazi@xxxxxxxxxx/#t
> > > > > > > 
> > > > > > > Please test and provide us the result.
> > > > > > > 
> > > > > > > thanks,
> > > > > > > Ming
> > > > > > >
> > > > > > 
> > > > > > Hi Ming,
> > > > > > 
> > > > > > I applied the patch to 5.6-rc4 but I observed the longer runtime of block/004.
> > > > > > Still it takes around 300 seconds.
> > > > > 
> > > > > Hello Shinichiro,
> > > > > 
> > > > > block/004 only sends 1564 sync randwrite, and seems 130s has been slow
> > > > > enough.
> > > > > 
> > > > > There are two related effect in that commit for your issue:
> > > > > 
> > > > > 1) 'at_head' is applied in blk_mq_sched_insert_request() for flush
> > > > > request
> > > > > 
> > > > > 2) all IO is added back to tail of hctx->dispatch after .queue_rq()
> > > > > returns STS_RESOURCE
> > > > > 
> > > > > Seems it is more related with 2) given you can't reproduce the issue on 
> > > > > SAS.
> > > > > 
> > > > > So please test the following two patches, and see which one makes a
> > > > > difference for you.
> > > > > 
> > > > > BTW, both two looks not reasonable, just for narrowing down the issue.
> > > > > 
> > > > > 1) patch 1
> > > > > 
> > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
> > > > > index 856356b1619e..86137c75283c 100644
> > > > > --- a/block/blk-mq-sched.c
> > > > > +++ b/block/blk-mq-sched.c
> > > > > @@ -398,7 +398,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head,
> > > > >  	WARN_ON(e && (rq->tag != -1));
> > > > >  
> > > > >  	if (blk_mq_sched_bypass_insert(hctx, !!e, rq)) {
> > > > > -		blk_mq_request_bypass_insert(rq, at_head, false);
> > > > > +		blk_mq_request_bypass_insert(rq, true, false);
> > > > >  		goto run;
> > > > >  	}
> > > > 
> > > > Ming, thank you for the trial patches.
> > > > This "patch 1" reduced the runtime, as short as rc3.
> > > > 
> > > > > 
> > > > > 
> > > > > 2) patch 2
> > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > index d92088dec6c3..447d5cb39832 100644
> > > > > --- a/block/blk-mq.c
> > > > > +++ b/block/blk-mq.c
> > > > > @@ -1286,7 +1286,7 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
> > > > >  			q->mq_ops->commit_rqs(hctx);
> > > > >  
> > > > >  		spin_lock(&hctx->lock);
> > > > > -		list_splice_tail_init(list, &hctx->dispatch);
> > > > > +		list_splice_init(list, &hctx->dispatch);
> > > > >  		spin_unlock(&hctx->lock);
> > > > >  
> > > > >  		/*
> > > > 
> > > > This patch 2 didn't reduce the runtime.
> > > > 
> > > > Wish this report helps.
> > > 
> > > Your feedback does help, then please test the following patch:
> > 
> > Hi Ming, thank you for the patch. I applied it on top of rc4 and confirmed
> > it reduces the runtime as short as rc3. Good.
> 
> Hi Shinichiro,
> 
> Thanks for your test!
> 
> Then I think the following change should make the difference actually,
> you may double check that and confirm if it is that.
> 
> > @@ -334,7 +334,7 @@ static void blk_kick_flush(struct request_queue *q, struct blk_flush_queue *fq,
> >  	flush_rq->rq_disk = first_rq->rq_disk;
> >  	flush_rq->end_io = flush_end_io;
> >  
> > -	blk_flush_queue_rq(flush_rq, false);
> > +	blk_flush_queue_rq(flush_rq, true);

Yes, with this the one line change above only, the runtime was reduced.

> 
> However, the flush request is added to tail of dispatch queue[1] for long time.
> 0cacba6cf825 ("blk-mq-sched: bypass the scheduler for flushes entirely")
> and its predecessor(all mq scheduler start patches) changed to add flush request
> to front of dispatch queue for blk-mq by ignoring 'add_queue' parameter of
> blk_mq_sched_insert_flush(). That change may be by accident, and not sure it is
> correct.
> 
> I guess once flush rq is added to tail of dispatch queue in block/004,
> in which lots of FS request may stay in hctx->dispatch because of low
> AHCI queue depth, then we may take a bit long for flush rq to be
> submitted to LLD.
> 
> I'd suggest to root cause/understand the issue given it isn't obvious
> correct to queue flush rq at front of dispatch queue, so could you collect
> the following trace via the following script with/without the single line
> patch?

Thank you for the thoughts for the correct design. I have taken the two traces,
with and without the one liner patch above. The gzip archived trace files have
1.6MB size. It looks too large to post to the list. Please let me know how you
want the trace files shared.

Thanks!

-- 
Best Regards,
Shin'ichiro Kawasaki

> 
> #!/bin/sh
> 
> MAJ=$1
> MIN=$2
> MAJ=$(( $MAJ << 20 ))
> DEV=$(( $MAJ | $MIN ))
> 
> /usr/share/bcc/tools/trace -t -C \
>     't:block:block_getrq (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
>     't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
>     't:block:block_rq_complete (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'
> 
> note:
> 1) install bcc package
> 2) pass the test disk(AHCI/SMR) major & minor to above script
> 3) start the script before running block/004
> 
> 
> 
> [1] 53d63e6b0dfb block: make the flush insertion use the tail of the dispatch list
> 
> Thanks,
> Ming
> 



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux