On 2024/04/24 1:36, Jens Axboe wrote: > On 4/23/24 9:16 AM, Damien Le Moal wrote: >> On 2024/04/24 0:21, Jens Axboe wrote: >>> On 4/20/24 1:58 AM, Damien Le Moal wrote: >>>> The submission of plugged BIOs is done using a work struct executing the >>>> function blk_zone_wplug_bio_work(). This function gets and submits a >>>> plugged zone write BIO and is guaranteed to operate on a valid zone >>>> write plug (with a reference count higher than 0) on entry as plugged >>>> BIOs hold a reference on their zone write plugs. However, once a BIO is >>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before >>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release >>>> and freeing of the zone write plug if the BIO is the last write to a >>>> zone (making the zone FULL). This potentially can result in the zone >>>> write plug being freed while the work is still active. >>>> >>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu(). >>>> >>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") >>>> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx> >>>> --- >>>> block/blk-zoned.c | 2 ++ >>>> 1 file changed, 2 insertions(+) >>>> >>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c >>>> index 3befebe6b319..685f0b9159fd 100644 >>>> --- a/block/blk-zoned.c >>>> +++ b/block/blk-zoned.c >>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head) >>>> struct blk_zone_wplug *zwplug = >>>> container_of(rcu_head, struct blk_zone_wplug, rcu_head); >>>> >>>> + flush_work(&zwplug->bio_work); >>>> + >>>> mempool_free(zwplug, zwplug->disk->zone_wplugs_pool); >>>> } >>> >>> This is totally backwards. First of all, if you actually had work that >>> needed flushing at this point, the kernel would bomb spectacularly. >>> Secondly, what's the point of using RCU to protect this, if you're now >>> needing to flush work from the RCU callback? That's a clear sign that >>> something is very wrong here with your references / RCU usage.. The work >>> item should hold a reference to it, trying to paper around it like this >>> is not going to work at all. >>> >>> Why is the work item racing with RCU freeing?! >> >> The work item is a field of the zone write plug. Zone write plugs have >> references to them as long as BIOs are in flight and and the zone is >> not full. The zone write plug freeing through rcu is triggered by the >> last write to a zone that makes the zone full. But the completion of >> this last write BIO may happen right after the work issued the BIO >> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work() >> returns, while the work item is still active. >> >> The actual freeing of the plug happens only after the rcu grace >> period, and I was not entirely sure if this is enough to guarantee >> that the work thread is finished. But checking how the workqueue code >> processes the work item by calling the work function >> (blk_zone_wplug_bio_work() in this case), there is no issue because >> the work item (struct work_struct) is not touched once the work >> function is called. So there are no issues/races with freeing the zone >> write plug. I was overthinking this. My bad. We can drop this patch. >> Apologies for the noise. > > I took a closer look at the zone write plug reference handling, and it > still doesn't look very good. Why are some just atomic_dec and there's > just one spot that does dec_and_test? This again looks like janky > referencing, to be honest. > > The relationship seems like it should be pretty clear. Any bio inflight > against this zone plug should have a reference to it, AND the owner > should have a reference to it, otherwise any bio completion (which can > happen at ANY time) could free it. Any dropping of the ref should use a > helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug() > does. > > There should be no doubt about the above at all. If the plug has been > added to a workqueue, it should be quite obvious that of course it has a > reference to it already, outside of the bio's that are in it. > > I'd strongly encourage getting this sorted out before the merge window, > I'm not at all convinced it's correct as-is. It's certainly not > obviously correct, which it should be. The RCU rules are pretty simple > if the the references are done in the kernel idiomatic way, but they are > not. OK. I am traveling this week so I will not be able to send a well-tested cleanup patch but I will do so first thing next week. -- Damien Le Moal Western Digital Research