Re: untangle the request_queue refcounting from the queue kobject v2

Al Viro <viro@xxxxxxxxxxxxxxxxxx> · Sat, 19 Nov 2022 02:19:43 +0000

On Mon, Nov 14, 2022 at 05:26:32AM +0100, Christoph Hellwig wrote:
> Hi Jens,
> 
> this series cleans up the registration of the "queue/" kobject, and given
> untangles it from the request_queue refcounting.
> 
> Changes since v1:
>  - also change the blk_crypto_sysfs_unregister prototype
>  - add two patches to fix the error handling in blk_register_queue

Umm...  Do we ever want access to queue parameters of the stuff that has
a queue, but no associated gendisk?  SCSI tape, for example...

	Re refcounting: AFAICS, blk_mq_alloc_disk_for_queue() is broken.
__alloc_disk_node() consumes queue reference (and stuffs it into gendisk->queue)
on success; on failure it leaves the reference alone.  E.g. this

struct gendisk *__blk_mq_alloc_disk(struct blk_mq_tag_set *set, void *queuedata,
		struct lock_class_key *lkclass)
{
	struct request_queue *q;
	struct gendisk *disk;

	q = blk_mq_init_queue_data(set, queuedata);
	if (IS_ERR(q))
		return ERR_CAST(q);

// we hold the initial reference to q
	disk = __alloc_disk_node(q, set->numa_node, lkclass);
	if (!disk) {
// __alloc_disk_node() failed, we are still holding q
		blk_mq_destroy_queue(q);
		blk_put_queue(q);
// reference dropped
		return ERR_PTR(-ENOMEM);
	}
	set_bit(GD_OWNS_QUEUE, &disk->state);
	return disk;
// ... and on success, the reference is consumed by disk, which is returned to caller
}

is fine; however, this
struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
		struct lock_class_key *lkclass)
{
	if (!blk_get_queue(q))
		return NULL;
	return __alloc_disk_node(q, NUMA_NO_NODE, lkclass);
}
can't be right - we might fail in blk_get_queue(), returning NULL with
unchanged refcount, we might succeed and return the new gendisk that
has consumed the extra reference grabbed by blk_get_queue() *OR*
we might grab an extra reference, fail in __alloc_disk_node() and
return NULL with refcount on q bumped.  No way for caller to tell these
failure modes from each other...  The callers (both sd and sr) treat
both as "no reference grabbed", i.e. leak the queue refcount if they
fail past grabbing the queue.

Looks like we should drop the queue if __alloc_disk_node() fails.  As in

struct gendisk *blk_mq_alloc_disk_for_queue(struct request_queue *q,
		struct lock_class_key *lkclass)
{
	struct gendisk *disk;

	if (!blk_get_queue(q))
		return NULL;
	disk = __alloc_disk_node(q, NUMA_NO_NODE, lkclass);
	if (!disk)
		blk_put_queue(q);
	return disk;
}

Objections?