Re: for-next branch and blktests/srp

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/6/18 2:04 PM, Jens Axboe wrote:
> On 12/6/18 1:56 PM, Bart Van Assche wrote:
>> On Thu, 2018-12-06 at 08:47 -0800, Bart Van Assche wrote:
>>> If I merge Jens' for-next branch with Linus' master branch, boot the
>>> resulting kernel in a VM and run blktests/tests/srp/002 then that test
>>> never finishes. The same test passes against Linus' master branch. I
>>> think this is a regression. The following appears in the system log if
>>> I run that test:
>>>
>>> Call Trace:
>>> INFO: task kworker/0:1:12 blocked for more than 120 seconds.
>>> Call Trace:
>>> INFO: task ext4lazyinit:2079 blocked for more than 120 seconds.
>>> Call Trace:
>>> INFO: task fio:2151 blocked for more than 120 seconds.
>>> Call Trace:
>>> INFO: task fio:2154 blocked for more than 120 seconds.
>>
>> Hi Jens,
>>
>> My test results so far are as follows:
>> * With kernel v4.20-rc5 test srp/002 passes.
>> * With your for-next branch test srp/002 reports the symptoms reported in my e-mail.
>> * With Linus' master branch from this morning test srp/002 fails in the same way as
>>   your for-next branch.
>> * Also with Linus' master branch, test srp/002 passes if I revert the following commit:
>>   ffe81d45322c ("blk-mq: fix corruption with direct issue"). So it seems like that
>>   commit fixed one regression but introduced another regression.
> 
> Yes, I'm on the same page, I've been able to reproduce. It seems to be related
> to dm and bypass insert, which is somewhat odd. If I just do:
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index deb56932f8c4..4c44e6fa0d08 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -2637,7 +2637,8 @@ blk_status_t blk_insert_cloned_request(struct request_queue *q, struct request *
>  		 * bypass a potential scheduler on the bottom device for
>  		 * insert.
>  		 */
> -		return blk_mq_request_issue_directly(rq);
> +		blk_mq_request_bypass_insert(rq, true);
> +		return BLK_STS_OK;
>  	}
>  
>  	spin_lock_irqsave(q->queue_lock, flags);
> 
> it works fine. Well, at least this regression is less serious, I'll bang
> out a fix for it and ensure we make -rc6. I'm guessing it's the bypassin
> of non-read/write, which your top of dispatch also shows to be a
> non-read/write. But there should be no new failure case here that wasn't
> possible before, only it's easier to hit now.

OK, so here's the thing. As part of the corruption fix, we disallowed
direct dispatch for anything that wasn't a read or write. This means
that your WRITE_ZEROES will always fail direct dispatch. When it does,
we return busy. But next time dm will try the exact same thing again,
blk_insert_cloned_request() -> direct dispatch -> fail. Before we'd succeed
eventually, now we will always fail for that type.

The insert clone path is unique in that regard.

So we have two options - the patch I did above which always just does
bypass insert for DM, or we need to mark the request as having failed
and just not retry direct dispatch for it. I'm still not crazy about
exploring the dispatch insert off this path.

And we'd need to do this on the original request in dm, not the clone
we are passed or it won't be persistent. Hence I lean towards the
already posted patch.

-- 
Jens Axboe




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux