On Tue, Mar 23, 2021 at 09:36:47AM +0100, Hannes Reinecke wrote: > On 3/23/21 8:31 AM, Sagi Grimberg wrote: > > > > > Actually, I had been playing around with marking the entire bio as > > > 'NOWAIT'; that would avoid the tag stall, too: > > > > > > @@ -313,7 +316,7 @@ blk_qc_t nvme_ns_head_submit_bio(struct bio *bio) > > > ns = nvme_find_path(head); > > > if (likely(ns)) { > > > bio_set_dev(bio, ns->disk->part0); > > > - bio->bi_opf |= REQ_NVME_MPATH; > > > + bio->bi_opf |= REQ_NVME_MPATH | REQ_NOWAIT; > > > trace_block_bio_remap(bio, disk_devt(ns->head->disk), > > > bio->bi_iter.bi_sector); > > > ret = submit_bio_noacct(bio); > > > > > > > > > My only worry here is that we might incur spurious failures under > > > high load; but then this is not necessarily a bad thing. > > > > What? making spurious failures is not ok under any load. what fs will > > take into account that you may have run out of tags? > > Well, it's not actually a spurious failure but rather a spurious failover, > as we're still on a multipath scenario, and bios will still be re-routed to > other paths. Or queued if all paths are out of tags. > Hence the OS would not see any difference in behaviour. Failover might be overkill. We can run out of tags in a perfectly normal situation, and simply waiting may be the best option, or even scheduling on a different CPU may be sufficient to get a viable tag rather than selecting a different path. Does it make sense to just abort all allocated tags during a reset and let the original bio requeue for multipath IO? > But in the end, we abandoned this attempt, as the crash we've been seeing > was in bio_endio (due to bi_bdev still pointing to the removed path device): > > [ 6552.155251] bio_endio+0x74/0x120 > [ 6552.155260] nvme_ns_head_submit_bio+0x36f/0x3e0 [nvme_core] > [ 6552.155271] submit_bio_noacct+0x175/0x490 > [ 6552.155284] ? nvme_requeue_work+0x5a/0x70 [nvme_core] > [ 6552.155290] nvme_requeue_work+0x5a/0x70 [nvme_core] > [ 6552.155296] process_one_work+0x1f4/0x3e0 > [ 6552.155299] worker_thread+0x2d/0x3e0 > [ 6552.155302] ? process_one_work+0x3e0/0x3e0 > [ 6552.155305] kthread+0x10d/0x130 > [ 6552.155307] ? kthread_park+0xa0/0xa0 > [ 6552.155311] ret_from_fork+0x35/0x40 > > So we're not blocked on blk_queue_enter(), and it's a crash, not a deadlock. > Blocking on blk_queue_enter() certainly plays a part here, > but is seems not to be the full picture. > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Kernel Storage Architect > hare@xxxxxxx +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer