Re: [PATCH] blk-mq: Fix cpu indexing error in blk_mq_alloc_request_hctx()

Jens Axboe <axboe@xxxxxxxxx> · Thu, 24 Oct 2019 07:02:23 -0600

On 10/24/19 3:28 AM, Ming Lei wrote:
> On Thu, Oct 24, 2019 at 4:53 PM James Smart <jsmart2021@xxxxxxxxx> wrote:
>>
>> During the following test scenario:
>> - Offline a cpu
>> - load lpfc driver, which auto-discovers NVMe devices. For a new
>>    nvme device, the lpfc/nvme_fc transport can request up to
>>    num_online_cpus() worth of nr_hw_queues. The target in
>>    this case allowed at least that many of nvme queues.
>> The system encountered the following crash:
>>
>>   BUG: unable to handle kernel paging request at 00003659d33953a8
>>   ...
>>   Workqueue: nvme-wq nvme_fc_connect_ctrl_work [nvme_fc]
>>   RIP: 0010:blk_mq_get_request+0x21d/0x3c0
>>   ...
>>   Blk_mq_alloc_request_hctx+0xef/0x140
>>   Nvme_alloc_request+0x32/0x80 [nvme_core]
>>   __nvme_submit_sync_cmd+0x4a/0x1c0 [nvme_core]
>>   Nvmf_connect_io_queue+0x130/0x1a0 [nvme_fabrics]
>>   Nvme_fc_connect_io_queues+0x285/0x2b0 [nvme_fc]
>>   Nvme_fc_create_association+0x0x8ea/0x9c0 [nvme_fc]
>>   Nvme_fc_connect_ctrl_work+0x19/0x50 [nvme_fc]
>>   ...
>>
>> There was a commit a while ago to simplify queue mapping which
>> replaced the use of cpumask_first() by cpumask_first_and().
>> The issue is if cpumask_first_and() does not find any _intersecting_ cpus,
>> it return's nr_cpu_id. nr_cpu_id isn't valid for the per_cpu_ptr index
>> which is done in __blk_mq_get_ctx().
>>
>> Considered reverting back to cpumask_first(), but instead followed
>> logic in blk_mq_first_mapped_cpu() to check for nr_cpu_id before
>> calling cpumask_first().
>>
>> Fixes: 20e4d8139319 ("blk-mq: simplify queue mapping & schedule with each possisble CPU")
>> Signed-off-by: Shagun Agrawal <shagun.agrawal@xxxxxxxxxxxx>
>> Signed-off-by: James Smart <jsmart2021@xxxxxxxxx>
>> CC: Christoph Hellwig <hch@xxxxxx>
>> ---
>>   block/blk-mq.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 8538dc415499..0b06b4ea57f1 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -461,6 +461,8 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
>>                  return ERR_PTR(-EXDEV);
>>          }
>>          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>> +       if (cpu >= nr_cpu_ids)
>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> 
> The first cpu may be offline too, then kernel warning or timeout may
> be triggered later
> when this allocated request is dispatched.
> 
> To be honest, blk_mq_alloc_request_hctx() is really a weird interface,
> given the hctx may become
> dead just when calling into blk_mq_alloc_request_hctx().
> 
> Given only NVMe FC/RDMA uses this interface, could you provide some
> background about
> this kind of usage?
> 
> The normal usage is that user doesn't specify the hctx for allocating
> request from, since blk-mq
> can figure out which hctx is used for allocation via queue mapping.
> Just wondering why NVMe
> FC/RDMA can't do that way?

Fully agree, it'd be much nicer if that weird interface could just
die.

-- 
Jens Axboe