On 4/23/24 12:19 PM, Damien Le Moal wrote: > On 2024/04/24 1:36, Jens Axboe wrote: >> On 4/23/24 9:16 AM, Damien Le Moal wrote: >>> On 2024/04/24 0:21, Jens Axboe wrote: >>>> On 4/20/24 1:58 AM, Damien Le Moal wrote: >>>>> The submission of plugged BIOs is done using a work struct executing the >>>>> function blk_zone_wplug_bio_work(). This function gets and submits a >>>>> plugged zone write BIO and is guaranteed to operate on a valid zone >>>>> write plug (with a reference count higher than 0) on entry as plugged >>>>> BIOs hold a reference on their zone write plugs. However, once a BIO is >>>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before >>>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release >>>>> and freeing of the zone write plug if the BIO is the last write to a >>>>> zone (making the zone FULL). This potentially can result in the zone >>>>> write plug being freed while the work is still active. >>>>> >>>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). >>>>> >>>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") >>>>> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> >>>>> --- >>>>> block/blk-zoned.c | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >>>>> index 3befebe6b319..685f0b9159fd 100644 >>>>> --- a/block/blk-zoned.c >>>>> +++ b/block/blk-zoned.c >>>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) >>>>> struct blk_zone_wplug *zwplug = >>>>> container_of(rcu_head, struct blk_zone_wplug, rcu_head); >>>>> >>>>> + flush_work(&zwplug->bio_work); >>>>> + >>>>> mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); >>>>> } >>>> >>>> This is totally backwards. First of all, if you actually had work that >>>> needed flushing at this point, the kernel would bomb spectacularly. >>>> Secondly, what's the point of using RCU to protect this, if you're now >>>> needing to flush work from the RCU callback? That's a clear sign that >>>> something is very wrong here with your references / RCU usage.. The work >>>> item should hold a reference to it, trying to paper around it like this >>>> is not going to work at all. >>>> >>>> Why is the work item racing with RCU freeing?! >>> >>> The work item is a field of the zone write plug. Zone write plugs have >>> references to them as long as BIOs are in flight and and the zone is >>> not full. The zone write plug freeing through rcu is triggered by the >>> last write to a zone that makes the zone full. But the completion of >>> this last write BIO may happen right after the work issued the BIO >>> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work() >>> returns, while the work item is still active. >>> >>> The actual freeing of the plug happens only after the rcu grace >>> period, and I was not entirely sure if this is enough to guarantee >>> that the work thread is finished. But checking how the workqueue code >>> processes the work item by calling the work function >>> (blk_zone_wplug_bio_work() in this case), there is no issue because >>> the work item (struct work_struct) is not touched once the work >>> function is called. So there are no issues/races with freeing the zone >>> write plug. I was overthinking this. My bad. We can drop this patch. >>> Apologies for the noise. >> >> I took a closer look at the zone write plug reference handling, and it >> still doesn't look very good. Why are some just atomic_dec and there's >> just one spot that does dec_and_test? This again looks like janky >> referencing, to be honest. >> >> The relationship seems like it should be pretty clear. Any bio inflight >> against this zone plug should have a reference to it, AND the owner >> should have a reference to it, otherwise any bio completion (which can >> happen at ANY time) could free it. Any dropping of the ref should use a >> helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug() >> does. >> >> There should be no doubt about the above at all. If the plug has been >> added to a workqueue, it should be quite obvious that of course it has a >> reference to it already, outside of the bio's that are in it. >> >> I'd strongly encourage getting this sorted out before the merge window, >> I'm not at all convinced it's correct as-is. It's certainly not >> obviously correct, which it should be. The RCU rules are pretty simple >> if the the references are done in the kernel idiomatic way, but they are >> not. > > OK. I am traveling this week so I will not be able to send a > well-tested cleanup patch but I will do so first thing next week. Sounds good, thanks. -- Jens Axboe