On 03.05.24 09:59, Sagi Grimberg wrote: > > > On 4/30/24 17:17, Yi Zhang wrote: >> On Tue, Apr 30, 2024 at 2:17 PM Johannes Thumshirn >> <Johannes.Thumshirn@xxxxxxx> wrote: >>> On 30.04.24 00:18, Chaitanya Kulkarni wrote: >>>> On 4/29/24 07:35, Johannes Thumshirn wrote: >>>>> On 23.04.24 15:18, Yi Zhang wrote: >>>>>> Hi >>>>>> I found this issue on the latest linux-block/for-next by blktests >>>>>> nvme/tcp nvme/012, please help check it and let me know if you need >>>>>> any info/testing for it, thanks. >>>>>> >>>>>> [ 1873.394323] run blktests nvme/012 at 2024-04-23 04:13:47 >>>>>> [ 1873.761900] loop0: detected capacity change from 0 to 2097152 >>>>>> [ 1873.846926] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>>>>> [ 1873.987806] nvmet_tcp: enabling port 0 (127.0.0.1:4420) >>>>>> [ 1874.208883] nvmet: creating nvm controller 1 for subsystem >>>>>> blktests-subsystem-1 for NQN >>>>>> nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. >>>>>> [ 1874.243423] nvme nvme0: creating 48 I/O queues. >>>>>> [ 1874.362383] nvme nvme0: mapped 48/0/0 default/read/poll queues. >>>>>> [ 1874.517677] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr >>>>>> 127.0.0.1:4420, hostnqn: >>>>>> nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349 >>>> [...] >>>> >>>>>> [ 326.827260] run blktests nvme/012 at 2024-04-29 16:28:31 >>>>>> [ 327.475957] loop0: detected capacity change from 0 to 2097152 >>>>>> [ 327.538987] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>>>>> >>>>>> [ 327.603405] nvmet_tcp: enabling port 0 (127.0.0.1:4420) >>>>>> >>>>>> >>>>>> [ 327.872343] nvmet: creating nvm controller 1 for subsystem >>>>>> blktests-subsystem-1 for NQN >>>>>> nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349. >>>>>> >>>>>> [ 327.877120] nvme nvme0: Please enable CONFIG_NVME_MULTIPATH for full >>>>>> support of multi-port devices. >>>> seems like you don't have multipath enabled that is one difference >>>> I can see in above log posted by Yi, and your log. >>> >>> Yup, but even with multipath enabled I can't get the bug to trigger :( >> It's not one 100% reproduced issue, I tried on my another server and >> it cannot be reproduced. > > Looking at the trace, I think I can see the issue here. In the test > case, nvme-mpath fails > the request upon submission as the queue is not live, and because it is > a mpath request, it > is failed over using nvme_failover_request, which steals the bios from > the request to its private > requeue list. > > The bisected patch, introduces req->bio dereference to a flush request > that has no bios (stolen > by the failover sequence). The reproduction seems to be related to in > where in the flush sequence > the request completion is called. > > I am unsure if simply making the dereference is the correct fix or > not... Damien? > -- > diff --git a/block/blk-flush.c b/block/blk-flush.c > index 2f58ae018464..c17cf8ed8113 100644 > --- a/block/blk-flush.c > +++ b/block/blk-flush.c > @@ -130,7 +130,8 @@ static void blk_flush_restore_request(struct request > *rq) > * original @rq->bio. Restore it. > */ > rq->bio = rq->biotail; > - rq->__sector = rq->bio->bi_iter.bi_sector; > + if (rq->bio) > + rq->__sector = rq->bio->bi_iter.bi_sector; > > /* make @rq a normal request */ > rq->rq_flags &= ~RQF_FLUSH_SEQ; > -- > This is something Damien added to his patch series. I just wonder, why I couldn't reproduce the failure, even with nvme-mpath enabled. I tried both nvme-tcp as well as nvme-loop without any problems.