Re: [PATCH v2 5/6] dm: fix dm_blk_report_zones

Damien Le Moal <dlemoal@xxxxxxxxxx> · Wed, 19 Mar 2025 10:40:38 +0900

On 3/19/25 7:10 AM, Benjamin Marzinski wrote:
> On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote:
>> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
>>> __bind(). Otherwise the zone resources could change while accessing
>>> them. Finally, it is possible that md->zone_revalidate_map will change
>>> while calling this function. Only read it once, so that we are always
>>> using the same value. Otherwise we might miss a call to
>>> dm_put_live_table().
>>
>> This checking for calling context is pretty ugly.  Can you just use
>> a separate report zones helper or at least a clearly documented flag for it?
> 
> Not sure how that would work. The goal is to keep another process,
> calling something like blkdev_report_zones_ioctl(), from being able to
> call its report_zones_cb, while DM is in blk_revalidate_disk_zones()
> which needs to use that same disk->fops->report_zones() interface in
> order to call blk_revalidate_zone_cb(). We need some way to distinguish
> between the callers. We could export blk_revalidate_zone_cb() and have
> dm_blk_report_zones() check if it was called with that report_zones_cb.
> That seems just as ugly to me. But if you like that better, fine.
> Otherwise I don't see how we can distinguish between the call from
> blk_revalidate_disk_zones() and a call from something like
> blkdev_report_zones_ioctl(). Am I not understanding your suggestion?
> 
> Allowing the blkdev_report_zones_ioctl() call to go ahead using
> md->zone_revalidate_map runs into three problems.
> 
> 1. It's running while the device is switching tables, and there's no
> guarantee that DM won't encounter an error and have to fail back. So it
> could report information for this unused table. We could just say that
> this is what you get from trying to grab the zone information of a
> device while it's switching tables. Fine.
> 
> 2. Without this patch, it's possible that while
> blkdev_report_zones_ioctl() is still running its report_zones_cb, DM
> fails switching tables and frees the new table that's being used by
> blkdev_report_zones_ioctl(), causing use-after-free errors. However,
> this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure
> that we're in a SRCU read section while using md->zone_revalidate_map.
> Making that chage should make DM as safe as any other zoned device,
> which brings me to...
> 
> 3. On any zoned device, not just DM, what's stopping
> one process from syncing the zone write plug offsets:
> blkdev_report_zones_ioctl() -> blkdev_report_zones() ->
> various.report_zones() -> disk_report_zones_cb() ->
> disk_zone_wplug_sync_wp_offset()

disk_zone_wplug_sync_wp_offset() is a nop for any zone write plug that does not
have BLK_ZONE_WPLUG_NEED_WP_UPDATE set. And that flag is set only for zones
that had a write error. Given that recovering from write errors on zoned device
requires to:
1) stop writing to the zone,
2) Do a report zone (which will sync the wp offset)
3) Restart writing
there is no write IOs going on for the zone that is being reported, for a well
behaved user. If the user is not well behaved, it will get errors/out of sync
wp, until the user behaves :) So I would not worry too much about this.
As we discussed, the table switching should be allowed only for the wildcard
target (dm-error) and that's it. We should not allow table switching otherwise.
And given that inserting dm-error does not cause any change to the zone
configuration, I do not see why we need to revalidate zones.

> 
> While another process is running into problems while dealing with the
> gendisk's zone configuration changing:
> 
> blk_revalidate_disk_zones() -> disk_free_zone_resources()

We should not allow the zone configuration to change. That is impossible to
deal with at the user level.

> I don't see any synchronizing between these two call paths. Am I missing
> something that stops this? Can this only happen for DM devices, for some
> reason? If this can't happen, I'm fine with just adding a
> srcu_read_lock() to dm_blk_report_zones() and calling it good.  If it
> can happen, then we should fix that.

I think it can happen today but no-one ever had this issue because no-one has
tried to switch a dm-table on a zoned drive. And I really think we should
prevent that from being done, except for dm-error since that's used for fstests.

> 
> -Ben
> 

-- 
Damien Le Moal
Western Digital Research