On 07/02/2017 04:45 AM, Max Gurtovoy wrote: > > > On 6/30/2017 8:26 PM, Jens Axboe wrote: >> Hi Max, > > Hi Jens, > >> >> I remembered you reporting this. I think this is a regression introduced >> with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[] >> is, but that's not indexable by the tag we find. So I think we need to >> guard those with a NULL check. The actual requests themselves are >> static, so we know the memory itself isn't going away. But if we race >> with completion, we could find a NULL there, validly. >> >> Since you could reproduce it, can you try the below? > > I still can repro the null deref with this patch applied. > >> >> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c >> index d0be72ccb091..b856b2827157 100644 >> --- a/block/blk-mq-tag.c >> +++ b/block/blk-mq-tag.c >> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> bitnr += tags->nr_reserved_tags; >> rq = tags->rqs[bitnr]; >> >> - if (rq->q == hctx->queue) >> + if (rq && rq->q == hctx->queue) >> iter_data->fn(hctx, rq, iter_data->data, reserved); >> return true; >> } >> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) >> if (!reserved) >> bitnr += tags->nr_reserved_tags; >> rq = tags->rqs[bitnr]; >> - >> - iter_data->fn(rq, iter_data->data, reserved); >> + if (rq) >> + iter_data->fn(rq, iter_data->data, reserved); >> return true; >> } > > see the attached file for dmesg output. > > output of gdb: > > (gdb) list *(blk_mq_flush_busy_ctxs+0x48) > 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs > (./include/linux/sbitmap.h:234). > 229 > 230 for (i = 0; i < sb->map_nr; i++) { > 231 struct sbitmap_word *word = &sb->map[i]; > 232 unsigned int off, nr; > 233 > 234 if (!word->word) > 235 continue; > 236 > 237 nr = 0; > 238 off = i << sb->shift; > > > when I change the "if (!word->word)" to "if (word && !word->word)" > I can get null deref at "nr = find_next_bit(&word->word, word->depth, > nr);". Seems like somehow word becomes NULL. > > Adding the linux-nvme guys too. > Sagi has mentioned that this can be null only if we remove the tagset > while I/O is trying to get a tag and when killing the target we get into > error recovery and periodic reconnects, which does _NOT_ include freeing > the tagset, so this is probably the admin tagset. > > Sagi, > you've mention a patch for centrelizing the treatment of the admin > tagset to the nvme core. I think I missed this patch, so can you please > send a pointer to it and I'll check if it helps ? Right, this is clearly a different issue and my first thought as well was that it's a missing quiesce of the queue. We're iterating the tags when they are being torn down. Looks like Sagi's patch fixes the issue, so I'm considering this one resolved. -- Jens Axboe