Il giorno mar, 26/11/2019 alle 10.32 +0800, Ming Lei ha scritto: > On Mon, Nov 25, 2019 at 07:51:33PM +0100, Andrea Vai wrote: > > Il giorno lun, 25/11/2019 alle 23.15 +0800, Ming Lei ha scritto: > > > On Mon, Nov 25, 2019 at 03:58:34PM +0100, Andrea Vai wrote: > > > > > > [...] > > > > > > > What to try next? > > > > > > 1) cat /sys/kernel/debug/block/$DISK/hctx0/flags > > result: > > > > alloc_policy=FIFO SHOULD_MERGE|2 > > > > > > > > > > > 2) echo 128 > /sys/block/$DISK/queue/nr_requests and run your > copy > > > 1GB > > > test again. > > > > done, and still fails. What to try next? > > I just run 256M cp test I would like to point out that 256MB is a filesize that usually don't trigger the issue (don't know if it matters, sorry). Another info I would provide is about another strange behavior I noticed: yesterday I ran the test two times (as usual with 1GB filesize) and took 2370s, 1786s, and a third test was going on when I stopped it. Then I started another set of 100 trials and let them run tonight, and the first 10 trials were around 1000s, then gradually decreased to ~300s, and finally settled around 200s with some trials below 70-80s. This to say, times are extremely variable and for the first time I noticed a sort of "performance increase" with time. > to one USB storage device on patched kernel, > and WRITE data IO is really in ascending order. The filesystem is > ext4, > and mount without '-o sync'. From previous discussion, looks that is > exactly your test setting. The order can be observed via the > following script: > > #!/bin/sh > MAJ=$1 > MIN=$2 > MAJ=$(( $MAJ << 20 )) > DEV=$(( $MAJ | $MIN )) > /usr/share/bcc/tools/trace -t -C \ > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args- > >rwbs, args->sector, args->nr_sector' > > $MAJ & $MIN can be retrieved via lsblk for your USB storage disk. > > So I think we need to check if the patch is applied correctly first. > > If your kernel tree is managed via git, yes it is, > please post 'git diff'. attached. Is it correctly patched? thanks. > Otherwise, share us your kernel version, btw, is 5.4.0+ > and I will send you one > backported patch on the kernel version. > > Meantime, you can collect IO order log via the above script as you > did last > time, then send us the log. ok, will try; is it just required to run it for a short period of time (say, some seconds) during the copy, or should I run it before the beginning (or before the mount?), and terminate it after the end of the copy? (Please note that in the latter case a large amount of time (and data, I suppose) would be involved, because, as said, to be sure the problem triggers I have to use a large file... but we can try to better understand and tune this. If it can help, you can get an ods file with the complete statistic at [1] (look at the "prove_nov19" sheet)). Thanks, Andrea [1]: http://fisica.unipv.it/transfer/kernelstats.zip
# git diff diff --git a/block/blk-mq.c b/block/blk-mq.c index ec791156e9cc..92d60a5e1d15 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1465,7 +1465,13 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, if (unlikely(blk_mq_hctx_stopped(hctx))) return; - if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) { + /* + * Some single-queue devices may need to dispatch IO in order + * which was guaranteed for the legacy queue via the big queue + * lock. Now we reply on single hctx->run_work for that. + */ + if (!async && !(hctx->flags & (BLK_MQ_F_BLOCKING | + BLK_MQ_F_STRICT_DISPATCH_ORDER))) { int cpu = get_cpu(); if (cpumask_test_cpu(cpu, hctx->cpumask)) { __blk_mq_run_hw_queue(hctx); @@ -3055,6 +3061,10 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) if (!set->ops->get_budget ^ !set->ops->put_budget) return -EINVAL; + if (set->queue_depth > 1 && (set->flags & + BLK_MQ_F_STRICT_DISPATCH_ORDER)) + return -EINVAL; + if (set->queue_depth > BLK_MQ_MAX_DEPTH) { pr_info("blk-mq: reduced tag depth to %u\n", BLK_MQ_MAX_DEPTH); diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 91c007d26c1e..f013630275c9 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1902,6 +1902,9 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost) shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; shost->tag_set.flags |= BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy); + if (shost->hostt->strict_dispatch_order) + shost->tag_set.flags |= BLK_MQ_F_STRICT_DISPATCH_ORDER; + shost->tag_set.driver_data = shost; return blk_mq_alloc_tag_set(&shost->tag_set); diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c index 54a3c8195c96..df1674d7f0fc 100644 --- a/drivers/usb/storage/scsiglue.c +++ b/drivers/usb/storage/scsiglue.c @@ -651,6 +651,18 @@ static const struct scsi_host_template usb_stor_host_template = { /* we do our own delay after a device or bus reset */ .skip_settle_delay = 1, diff --git a/block/blk-mq.c b/block/blk-mq.c index ec791156e9cc..92d60a5e1d15 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -1465,7 +1465,13 @@ static void __blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async, if (unlikely(blk_mq_hctx_stopped(hctx))) return; - if (!async && !(hctx->flags & BLK_MQ_F_BLOCKING)) { + /* + * Some single-queue devices may need to dispatch IO in order + * which was guaranteed for the legacy queue via the big queue + * lock. Now we reply on single hctx->run_work for that. + */ + if (!async && !(hctx->flags & (BLK_MQ_F_BLOCKING | + BLK_MQ_F_STRICT_DISPATCH_ORDER))) { int cpu = get_cpu(); if (cpumask_test_cpu(cpu, hctx->cpumask)) { __blk_mq_run_hw_queue(hctx); @@ -3055,6 +3061,10 @@ int blk_mq_alloc_tag_set(struct blk_mq_tag_set *set) if (!set->ops->get_budget ^ !set->ops->put_budget) return -EINVAL; + if (set->queue_depth > 1 && (set->flags & + BLK_MQ_F_STRICT_DISPATCH_ORDER)) + return -EINVAL; + if (set->queue_depth > BLK_MQ_MAX_DEPTH) { pr_info("blk-mq: reduced tag depth to %u\n", BLK_MQ_MAX_DEPTH); diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 91c007d26c1e..f013630275c9 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -1902,6 +1902,9 @@ int scsi_mq_setup_tags(struct Scsi_Host *shost) shost->tag_set.flags = BLK_MQ_F_SHOULD_MERGE; shost->tag_set.flags |= BLK_ALLOC_POLICY_TO_MQ_FLAG(shost->hostt->tag_alloc_policy); + if (shost->hostt->strict_dispatch_order) + shost->tag_set.flags |= BLK_MQ_F_STRICT_DISPATCH_ORDER; + shost->tag_set.driver_data = shost; return blk_mq_alloc_tag_set(&shost->tag_set); diff --git a/drivers/usb/storage/scsiglue.c b/drivers/usb/storage/scsiglue.c index 54a3c8195c96..df1674d7f0fc 100644 --- a/drivers/usb/storage/scsiglue.c +++ b/drivers/usb/storage/scsiglue.c @@ -651,6 +651,18 @@ static const struct scsi_host_template usb_stor_host_template = { /* we do our own delay after a device or bus reset */ .skip_settle_delay = 1, + + /* + * Some USB storage, such as Kingston Technology DataTraveler 100 + * G3/G4/SE9 G2(ID 0951:1666), requires IO dispatched in the + * sequential order, otherwise IO performance may drop drastically. + * + * can_queue is always 1, so we set .strict_dispatch_order for + * USB mass storage HBA. Another reason is that there can be such + * kind of devices too. + */ + .strict_dispatch_order = 1, + /* sysfs device attributes */ .sdev_attrs = sysfs_device_attr_list, diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 0bf056de5cc3..89b1c28da36a 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -226,6 +226,7 @@ struct blk_mq_ops { enum { BLK_MQ_F_SHOULD_MERGE = 1 << 0, BLK_MQ_F_TAG_SHARED = 1 << 1, + BLK_MQ_F_STRICT_DISPATCH_ORDER = 1 << 2, BLK_MQ_F_BLOCKING = 1 << 5, BLK_MQ_F_NO_SCHED = 1 << 6, BLK_MQ_F_ALLOC_POLICY_START_BIT = 8, diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 31e0d6ca1eba..dbcbc9ef6993 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -442,6 +442,13 @@ struct scsi_host_template { /* True if the low-level driver supports blk-mq only */ unsigned force_blk_mq:1; + /* + * True if the low-level driver needs IO to be dispatched in + * the order provided by legacy IO path. The flag is only + * valid for single queue device. + */ + unsigned strict_dispatch_order:1; + /* * Countdown for host blocking with no commands outstanding. */