Re: [PATCH 2/2] nvme-multipath: don't block on blk_queue_enter of the underlying device

Hannes Reinecke <hare@xxxxxxx> · Tue, 23 Mar 2021 09:36:47 +0100




On 3/23/21 8:31 AM, Sagi Grimberg wrote:

Actually, I had been playing around with marking the entire bio as 
'NOWAIT'; that would avoid the tag stall, too:

@@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio)
         ns = nvme_find_path(head);
         if (likely(ns)) {
                 bio_set_dev(bio, ns->disk->part0);
-               bio->bi_opf |= REQ_NVME_MPATH;
+               bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT;
                 trace_block_bio_remap(bio, disk_devt(ns->head->disk),
                                       bio->bi_iter.bi_sector);
                 ret = submit_bio_noacct(bio);


My only worry here is that we might incur spurious failures under high 
load; but then this is not necessarily a bad thing.

What? making spurious failures is not ok under any load. what fs will
take into account that you may have run out of tags?

Well, it's not actually a spurious failure but rather a spurious 
failover, as we're still on a multipath scenario, and bios will still be 
re-routed to other paths. Or queued if all paths are out of tags.
Hence the OS would not see any difference in behaviour.

But in the end, we abandoned this attempt, as the crash we've been 
seeing was in bio_endio (due to bi_bdev still pointing to the removed 
path device):

[ 6552.155251]  bio_endio+0x74/0x120
[ 6552.155260]  nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core]
[ 6552.155271]  submit_bio_noacct+0x175/0x490
[ 6552.155284]  ? nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155290]  nvme_requeue_work+0x5a/0x70 [nvme_core]
[ 6552.155296]  process_one_work+0x1f4/0x3e0
[ 6552.155299]  worker_thread+0x2d/0x3e0
[ 6552.155302]  ? process_one_work+0x3e0/0x3e0
[ 6552.155305]  kthread+0x10d/0x130
[ 6552.155307]  ? kthread_park+0xa0/0xa0
[ 6552.155311]  ret_from_fork+0x35/0x40

So we're not blocked on blk_queue_enter(), and it's a crash, not a 
deadlock. Blocking on blk_queue_enter() certainly plays a part here,
but is seems not to be the full picture.

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer