On 2/5/24 21:20, Damien Le Moal wrote: > On 2/5/24 19:06, Ming Lei wrote: >> On Mon, Feb 05, 2024 at 11:41:04AM +0900, Damien Le Moal wrote: >>> On 2/5/24 11:19, Ming Lei wrote: >>>>>>> +static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug, >>>>>>> + struct bio *bio, unsigned int nr_segs) >>>>>>> +{ >>>>>>> + /* >>>>>>> + * Keep a reference on the BIO request queue usage. This reference will >>>>>>> + * be dropped either if the BIO is failed or after it is issued and >>>>>>> + * completes. >>>>>>> + */ >>>>>>> + percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter); >>>>>> >>>>>> It is fragile to get nested usage_counter, and same with grabbing/releasing it >>>>>> from different contexts or even functions, and it could be much better to just >>>>>> let block layer maintain it. >>>>>> >>>>>> From patch 23's change: >>>>>> >>>>>> + * Zoned block device information. Reads of this information must be >>>>>> + * protected with blk_queue_enter() / blk_queue_exit(). Modifying this >>>>>> >>>>>> Anytime if there is in-flight bio, the block device is opened, so both gendisk and >>>>>> request_queue are live, so not sure if this .q_usage_counter protection >>>>>> is needed. >>>>> >>>>> Hannes also commented about this. Let me revisit this. >>>> >>>> I think only queue re-configuration(blk_revalidate_zone) requires the >>>> queue usage counter. Otherwise, bdev open()/close() should work just >>>> fine. >>> >>> I want to check FS case though. No clear if mounting FS that supports zone >>> (btrfs) also uses bdev open ? >> >> I feel the following delta change might be cleaner and easily documented: >> >> - one IO takes single reference for both bio based and blk-mq, >> - no drop & re-grab >> - only grab extra reference for bio based >> - two code paths share same pattern >> >> diff --git a/block/blk-core.c b/block/blk-core.c >> index 9520ccab3050..118dd789beb5 100644 >> --- a/block/blk-core.c >> +++ b/block/blk-core.c >> @@ -597,6 +597,10 @@ static void __submit_bio(struct bio *bio) >> >> if (!bio->bi_bdev->bd_has_submit_bio) { >> blk_mq_submit_bio(bio); >> + } else if (bio_zone_write_plugging(bio)) { >> + struct gendisk *disk = bio->bi_bdev->bd_disk; >> + >> + disk->fops->submit_bio(bio); Actually, no, that is not correct. This would not stop BIO submission if blk_queue_freeze() was called by another context. So we cannot do that here without calling blk_queue_enter()... >> } else if (likely(bio_queue_enter(bio) == 0)) { >> struct gendisk *disk = bio->bi_bdev->bd_disk; >> >> diff --git a/block/blk-mq.c b/block/blk-mq.c >> index f0fc61a3ec81..fc6d792747dc 100644 >> --- a/block/blk-mq.c >> +++ b/block/blk-mq.c >> @@ -3006,8 +3006,12 @@ void blk_mq_submit_bio(struct bio *bio) >> if (blk_mq_attempt_bio_merge(q, bio, nr_segs)) >> goto queue_exit; >> >> + /* >> + * Grab one reference for plugged zoned write and it will be reused in >> + * next real submission >> + */ >> if (blk_queue_is_zoned(q) && blk_zone_write_plug_bio(bio, nr_segs)) >> - goto queue_exit; >> + return; ...and this one is not correct because of the cached request: if there was a cached request, blk_mq_submit_bio() did not call blk_queue_enter() because the cached request already had a reference. But we cannot reuse that reference as the next BIO may be a read or a write to a zone that is not plugged, and these would use the cached request and so need the usage counter reference. So we would still need to grab an extra reference in such case. So in the end, it feels a lot simpler to keep the reference counting as it was as it makes things a lot less messier in blk_mq_submit_bio(). I will though try to improve the comments to make it clear how this is working. >> >> if (!rq) { >> new_request: >> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >> index f6d4f511b664..87abb3f7ef30 100644 >> --- a/block/blk-zoned.c >> +++ b/block/blk-zoned.c >> @@ -514,7 +514,8 @@ static inline void blk_zone_wplug_add_bio(struct blk_zone_wplug *zwplug, >> * be dropped either if the BIO is failed or after it is issued and >> * completes. >> */ >> - percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter); >> + if (bio->bi_bdev->bd_has_submit_bio) >> + percpu_ref_get(&bio->bi_bdev->bd_disk->queue->q_usage_counter); >> >> /* >> * The BIO is being plugged and thus will have to wait for the on-going >> @@ -760,15 +761,10 @@ static void blk_zone_wplug_bio_work(struct work_struct *work) >> >> blk_zone_wplug_unlock(zwplug, flags); >> >> - /* >> - * blk-mq devices will reuse the reference on the request queue usage >> - * we took when the BIO was plugged, but the submission path for >> - * BIO-based devices will not do that. So drop this reference here. >> - */ >> + submit_bio_noacct_nocheck(bio); >> + >> if (bio->bi_bdev->bd_has_submit_bio) >> blk_queue_exit(bio->bi_bdev->bd_disk->queue); > > Hmm... As-is, this is a potential use-after-free of the bio. But I get the idea. > This is indeed a little better. I will integrate this. > >> - >> - submit_bio_noacct_nocheck(bio); >> } >> >> static struct blk_zone_wplug *blk_zone_alloc_write_plugs(unsigned int nr_zones) >> >> Thanks, >> Ming >> >> > -- Damien Le Moal Western Digital Research