On 12/4/18 6:37 PM, Ming Lei wrote: > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then >> we queue the request up normally. However, the SCSI layer may have >> already setup SG tables etc for this particular command. If we later >> merge with this request, then the old tables are no longer valid. Once >> we issue the IO, we only read/write the original part of the request, >> not the new state of it. >> >> This causes data corruption, and is most often noticed with the file >> system complaining about the just read data being invalid: >> >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) >> >> because most of it is garbage... >> >> This doesn't happen from the normal issue path, as we will simply defer >> the request to the hardware queue dispatch list if we fail. Once it's on >> the dispatch list, we never merge with it. >> >> Fix this from the direct issue path by flagging the request as >> REQ_NOMERGE so we don't change the size of it before issue. >> >> See also: >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 >> >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >> >> --- >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index 3f91c6e5b17a..d8f518c6ea38 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, >> break; >> case BLK_STS_RESOURCE: >> case BLK_STS_DEV_RESOURCE: >> + /* >> + * If direct dispatch fails, we cannot allow any merging on >> + * this IO. Drivers (like SCSI) may have set up permanent state >> + * for this request, like SG tables and mappings, and if we >> + * merge to it later on then we'll still only do IO to the >> + * original part. >> + */ >> + rq->cmd_flags |= REQ_NOMERGE; >> + >> blk_mq_update_dispatch_busy(hctx, true); >> __blk_mq_requeue_request(rq); >> break; >> > > Not sure it is enough to just mark it as NOMERGE, for example, driver > may have setup the .special_vec for discard, and NOMERGE may not prevent > request from entering elevator queue completely. Cause 'rq.rb_node' and > 'rq.special_vec' share same space. We should rather limit the scope of the direct dispatch instead. It doesn't make sense to do for anything but read/write anyway. > So how about inserting this request via blk_mq_request_bypass_insert() > in case that direct issue returns BUSY? Then it is invariant that > any request queued via .queue_rq() won't enter scheduler queue. I did consider this, but I didn't want to experiment with exercising a new path for an important bug fix. You do realize that your original patch has been corrupting data for months? I think a little caution is in order here. -- Jens Axboe