Re: [PATCH] blk-mq: fix corruption with direct issue

Jens Axboe <axboe@xxxxxxxxx> · Tue, 4 Dec 2018 19:30:24 -0700

On 12/4/18 7:27 PM, Ming Lei wrote:
> On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote:
>> On 12/4/18 6:37 PM, Ming Lei wrote:
>>> On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote:
>>>> If we attempt a direct issue to a SCSI device, and it returns BUSY, then
>>>> we queue the request up normally. However, the SCSI layer may have
>>>> already setup SG tables etc for this particular command. If we later
>>>> merge with this request, then the old tables are no longer valid. Once
>>>> we issue the IO, we only read/write the original part of the request,
>>>> not the new state of it.
>>>>
>>>> This causes data corruption, and is most often noticed with the file
>>>> system complaining about the just read data being invalid:
>>>>
>>>> [  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)
>>>>
>>>> because most of it is garbage...
>>>>
>>>> This doesn't happen from the normal issue path, as we will simply defer
>>>> the request to the hardware queue dispatch list if we fail. Once it's on
>>>> the dispatch list, we never merge with it.
>>>>
>>>> Fix this from the direct issue path by flagging the request as
>>>> REQ_NOMERGE so we don't change the size of it before issue.
>>>>
>>>> See also:
>>>>   https://bugzilla.kernel.org/show_bug.cgi?id=201685
>>>>
>>>> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
>>>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>>>
>>>> ---
>>>>
>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>> index 3f91c6e5b17a..d8f518c6ea38 100644
>>>> --- a/block/blk-mq.c
>>>> +++ b/block/blk-mq.c
>>>> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
>>>>  		break;
>>>>  	case BLK_STS_RESOURCE:
>>>>  	case BLK_STS_DEV_RESOURCE:
>>>> +		/*
>>>> +		 * If direct dispatch fails, we cannot allow any merging on
>>>> +		 * this IO. Drivers (like SCSI) may have set up permanent state
>>>> +		 * for this request, like SG tables and mappings, and if we
>>>> +		 * merge to it later on then we'll still only do IO to the
>>>> +		 * original part.
>>>> +		 */
>>>> +		rq->cmd_flags |= REQ_NOMERGE;
>>>> +
>>>>  		blk_mq_update_dispatch_busy(hctx, true);
>>>>  		__blk_mq_requeue_request(rq);
>>>>  		break;
>>>>
>>>
>>> Not sure it is enough to just mark it as NOMERGE, for example, driver
>>> may have setup the .special_vec for discard, and NOMERGE may not prevent
>>> request from entering elevator queue completely. Cause 'rq.rb_node' and
>>> 'rq.special_vec' share same space.
>>
>> We should rather limit the scope of the direct dispatch instead. It
>> doesn't make sense to do for anything but read/write anyway.
> 
> discard is kind of write, and it isn't treated very specially in make
> request path, except for multi-range discard.

The point of direct dispatch is to reduce latencies for requests,
discards are so damn slow on ALL devices anyway that it doesn't make any
sense to try direct dispatch to begin with, regardless of whether it
possible or not.

>>> So how about inserting this request via blk_mq_request_bypass_insert()
>>> in case that direct issue returns BUSY? Then it is invariant that
>>> any request queued via .queue_rq() won't enter scheduler queue.
>>
>> I did consider this, but I didn't want to experiment with exercising
>> a new path for an important bug fix. You do realize that your original
>> patch has been corrupting data for months? I think a little caution
>> is in order here.
> 
> But marking NOMERGE still may have a hole on re-insert discard request as
> mentioned above.

What I said was further limit the scope of direct dispatch, which means
not allowing anything that isn't a read/write.

> Given we never allow to re-insert queued request to scheduler queue
> except for 6ce3dd6eec1, I think it is the correct thing to do, and the
> fix is simple too.

As I said, it's not the time to experiment. This issue has been there
since 4.19-rc1. The alternative is yanking both those patches, and then
looking at it later when the direct issue path has been cleaned up
first.

-- 
Jens Axboe