On Tue, Dec 04, 2018 at 07:16:11PM -0700, Jens Axboe wrote: > On 12/4/18 6:37 PM, Ming Lei wrote: > > On Tue, Dec 04, 2018 at 03:47:46PM -0700, Jens Axboe wrote: > >> If we attempt a direct issue to a SCSI device, and it returns BUSY, then > >> we queue the request up normally. However, the SCSI layer may have > >> already setup SG tables etc for this particular command. If we later > >> merge with this request, then the old tables are no longer valid. Once > >> we issue the IO, we only read/write the original part of the request, > >> not the new state of it. > >> > >> This causes data corruption, and is most often noticed with the file > >> system complaining about the just read data being invalid: > >> > >> [ 235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256) > >> > >> because most of it is garbage... > >> > >> This doesn't happen from the normal issue path, as we will simply defer > >> the request to the hardware queue dispatch list if we fail. Once it's on > >> the dispatch list, we never merge with it. > >> > >> Fix this from the direct issue path by flagging the request as > >> REQ_NOMERGE so we don't change the size of it before issue. > >> > >> See also: > >> https://bugzilla.kernel.org/show_bug.cgi?id=201685 > >> > >> Fixes: 6ce3dd6eec1 ("blk-mq: issue directly if hw queue isn't busy in case of 'none'") > >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > >> > >> --- > >> > >> diff --git a/block/blk-mq.c b/block/blk-mq.c > >> index 3f91c6e5b17a..d8f518c6ea38 100644 > >> --- a/block/blk-mq.c > >> +++ b/block/blk-mq.c > >> @@ -1715,6 +1715,15 @@ static blk_status_t __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx, > >> break; > >> case BLK_STS_RESOURCE: > >> case BLK_STS_DEV_RESOURCE: > >> + /* > >> + * If direct dispatch fails, we cannot allow any merging on > >> + * this IO. Drivers (like SCSI) may have set up permanent state > >> + * for this request, like SG tables and mappings, and if we > >> + * merge to it later on then we'll still only do IO to the > >> + * original part. > >> + */ > >> + rq->cmd_flags |= REQ_NOMERGE; > >> + > >> blk_mq_update_dispatch_busy(hctx, true); > >> __blk_mq_requeue_request(rq); > >> break; > >> > > > > Not sure it is enough to just mark it as NOMERGE, for example, driver > > may have setup the .special_vec for discard, and NOMERGE may not prevent > > request from entering elevator queue completely. Cause 'rq.rb_node' and > > 'rq.special_vec' share same space. > > We should rather limit the scope of the direct dispatch instead. It > doesn't make sense to do for anything but read/write anyway. discard is kind of write, and it isn't treated very specially in make request path, except for multi-range discard. > > > So how about inserting this request via blk_mq_request_bypass_insert() > > in case that direct issue returns BUSY? Then it is invariant that > > any request queued via .queue_rq() won't enter scheduler queue. > > I did consider this, but I didn't want to experiment with exercising > a new path for an important bug fix. You do realize that your original > patch has been corrupting data for months? I think a little caution > is in order here. But marking NOMERGE still may have a hole on re-insert discard request as mentioned above. Given we never allow to re-insert queued request to scheduler queue except for 6ce3dd6eec1, I think it is the correct thing to do, and the fix is simple too. Thanks, Ming