Re: [PATCH v2 1/2] block: prevent freeing a zone write plug too early

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 4/23/24 12:19 PM, Damien Le Moal wrote:
> On 2024/04/24 1:36, Jens Axboe wrote:
>> On 4/23/24 9:16 AM, Damien Le Moal wrote:
>>> On 2024/04/24 0:21, Jens Axboe wrote:
>>>> On 4/20/24 1:58 AM, Damien Le Moal wrote:
>>>>> The submission of plugged BIOs is done using a work struct executing the
>>>>> function blk_zone_wplug_bio_work(). This function gets and submits a
>>>>> plugged zone write BIO and is guaranteed to operate on a valid zone
>>>>> write plug (with a reference count higher than 0) on entry as plugged
>>>>> BIOs hold a reference on their zone write plugs. However, once a BIO is
>>>>> submitted with submit_bio_noacct_nocheck(), the BIO may complete before
>>>>> blk_zone_wplug_bio_work(), with the BIO completion trigering a release
>>>>> and freeing of the zone write plug if the BIO is the last write to a
>>>>> zone (making the zone FULL). This potentially can result in the zone
>>>>> write plug being freed while the work is still active.
>>>>>
>>>>> Avoid this by calling flush_work() from disk_free_zone_wplug_rcu().
>>>>>
>>>>> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
>>>>> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
>>>>> ---
>>>>>  block/blk-zoned.c | 2 ++
>>>>>  1 file changed, 2 insertions(+)
>>>>>
>>>>> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
>>>>> index 3befebe6b319..685f0b9159fd 100644
>>>>> --- a/block/blk-zoned.c
>>>>> +++ b/block/blk-zoned.c
>>>>> @@ -526,6 +526,8 @@ static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
>>>>>  	struct blk_zone_wplug *zwplug =
>>>>>  		container_of(rcu_head, struct blk_zone_wplug, rcu_head);
>>>>>  
>>>>> +	flush_work(&zwplug->bio_work);
>>>>> +
>>>>>  	mempool_free(zwplug, zwplug->disk->zone_wplugs_pool);
>>>>>  }
>>>>
>>>> This is totally backwards. First of all, if you actually had work that
>>>> needed flushing at this point, the kernel would bomb spectacularly.
>>>> Secondly, what's the point of using RCU to protect this, if you're now
>>>> needing to flush work from the RCU callback? That's a clear sign that
>>>> something is very wrong here with your references / RCU usage.. The work
>>>> item should hold a reference to it, trying to paper around it like this
>>>> is not going to work at all.
>>>>
>>>> Why is the work item racing with RCU freeing?!
>>>
>>> The work item is a field of the zone write plug. Zone write plugs have
>>> references to them as long as BIOs are in flight and and the zone is
>>> not full. The zone write plug freeing through rcu is triggered by the
>>> last write to a zone that makes the zone full. But the completion of
>>> this last write BIO may happen right after the work issued the BIO
>>> with submit_bio_noacct_nocheck() and before blk_zone_wplug_bio_work()
>>> returns, while the work item is still active.
>>>
>>> The actual freeing of the plug happens only after the rcu grace
>>> period, and I was not entirely sure if this is enough to guarantee
>>> that the work thread is finished. But checking how the workqueue code
>>> processes the work item by calling the work function
>>> (blk_zone_wplug_bio_work() in this case), there is no issue because
>>> the work item (struct work_struct) is not touched once the work
>>> function is called. So there are no issues/races with freeing the zone
>>> write plug. I was overthinking this. My bad. We can drop this patch.
>>> Apologies for the noise.
>>
>> I took a closer look at the zone write plug reference handling, and it
>> still doesn't look very good. Why are some just atomic_dec and there's
>> just one spot that does dec_and_test? This again looks like janky
>> referencing, to be honest.
>>
>> The relationship seems like it should be pretty clear. Any bio inflight
>> against this zone plug should have a reference to it, AND the owner
>> should have a reference to it, otherwise any bio completion (which can
>> happen at ANY time) could free it. Any dropping of the ref should use a
>> helper that does atomic_dec_and_test(), eg what disk_put_zone_wplug()
>> does.
>>
>> There should be no doubt about the above at all. If the plug has been
>> added to a workqueue, it should be quite obvious that of course it has a
>> reference to it already, outside of the bio's that are in it.
>>
>> I'd strongly encourage getting this sorted out before the merge window,
>> I'm not at all convinced it's correct as-is. It's certainly not
>> obviously correct, which it should be. The RCU rules are pretty simple
>> if the the references are done in the kernel idiomatic way, but they are
>> not.
> 
> OK. I am traveling this week so I will not be able to send a
> well-tested cleanup patch but I will do so first thing next week.

Sounds good, thanks.

-- 
Jens Axboe





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux