Re: [PATCH] blk-mq: fix corruption with direct issue

Jens Axboe <axboe@xxxxxxxxx> · Tue, 4 Dec 2018 20:04:17 -0700

On 12/4/18 7:58 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:30:24PM -0700, Jens Axboe wrote:
>> On 12/4/18 7:27 PM, Ming Lei wrote:
>>> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
>>>> On 12/4/18 6:37 PM, Ming Lei wrote:
>>>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>>>>> we queue the request up normally. However, the SCSI layer may have
>>>>>> already setup SG tables etc for this particular command. If we later
>>>>>> merge with this request, then the old tables are no longer valid. Once
>>>>>> we issue the IO, we only read/write the original part of the request,
>>>>>> not the new state of it.
>>>>>>
>>>>>> This causes data corruption, and is most often noticed with the file
>>>>>> system complaining about the just read data being invalid:
>>>>>>
>>>>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>>>>
>>>>>> because most of it is garbage...
>>>>>>
>>>>>> This doesn't happen from the normal issue path, as we will simply defer
>>>>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>>>>> the dispatch list, we never merge with it.
>>>>>>
>>>>>> Fix this from the direct issue path by flagging the request as
>>>>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>>>>
>>>>>> See also:
>>>>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>>>>
>>>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>>>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>>>>
>>>>>> ---
>>>>>>
>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>>>>> --- a/block/blk-mq.c
>>>>>> +++ b/block/blk-mq.c
>>>>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>>>>>  		break;
>>>>>>  	case BLK_STS_RESOURCE:
>>>>>>  	case BLK_STS_DEV_RESOURCE:
>>>>>> +		/*
>>>>>> +		 * If direct dispatch fails, we cannot allow any merging on
>>>>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
>>>>>> +		 * for this request, like SG tables and mappings, and if we
>>>>>> +		 * merge to it later on then we'll still only do IO to the
>>>>>> +		 * original part.
>>>>>> +		 */
>>>>>> +		rq->cmd_flags |= REQ_NOMERGE;
>>>>>> +
>>>>>>  		blk_mq_update_dispatch_busy(hctx, true);
>>>>>>  		__blk_mq_requeue_request(rq);
>>>>>>  		break;
>>>>>>
>>>>>
>>>>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>>>>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>>>>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>>>>> 'rq.special_vec' share same space.
>>>>
>>>> We should rather limit the scope of the direct dispatch instead. It
>>>> doesn't make sense to do for anything but read/write anyway.
>>>
>>> discard is kind of write, and it isn't treated very specially in make
>>> request path, except for multi-range discard.
>>
>> The point of direct dispatch is to reduce latencies for requests,
>> discards are so damn slow on ALL devices anyway that it doesn't make any
>> sense to try direct dispatch to begin with, regardless of whether it
>> possible or not.
> 
> SCSI MQ device may benefit from direct dispatch from reduced lock contention.
> 
>>
>>>>> So how about inserting this request via blk_mq_request_bypass_insert()
>>>>> in case that direct issue returns BUSY? Then it is invariant that
>>>>> any request queued via .queue_rq() won't enter scheduler queue.
>>>>
>>>> I did consider this, but I didn't want to experiment with exercising
>>>> a new path for an important bug fix. You do realize that your original
>>>> patch has been corrupting data for months? I think a little caution
>>>> is in order here.
>>>
>>> But marking NOMERGE still may have a hole on re-insert discard request as
>>> mentioned above.
>>
>> What I said was further limit the scope of direct dispatch, which means
>> not allowing anything that isn't a read/write.
> 
> IMO, the conservative approach is to take the one used in legacy io
> path, in which it is never allowed to re-insert queued request to
> scheduler queue except for requeue, however RQF_DONTPREP is cleared
> before requeuing request to scheduler.

I don't agree. Whether you agree or not, I'm doing the simple fix for
4.20 and queueing it up for 4.21 as well. Then I hope Jianchao can
rework his direct dispatch cleanup patches for 4.21 on top of that,
and then we can experiment on top of that with other solutions.

>>> Given we never allow to re-insert queued request to scheduler queue
>>> except for 6ce3dd6eec1, I think it is the correct thing to do, and the
>>> fix is simple too.
>>
>> As I said, it's not the time to experiment. This issue has been there
>> since 4.19-rc1. The alternative is yanking both those patches, and then
>> looking at it later when the direct issue path has been cleaned up
>> first.
> 
> The issue should have been there from v4.1, especially after commit
> f984df1f0f7 ("blk-mq: do limited block plug for multiple queue case"),
> which is the 1st one to re-insert the queued request into scheduler
> queue.

You can hand wave all you want, fact is that people have been hitting
this left and right since 4.19 was released, and it was even bisected
down to the specific two direct issue patches you did. This isn't
some theoretical debate. You've seen the bugzilla, I suggest you go
read the whole thing.

-- 
Jens Axboe