Re: [PATCH] block: Remove zone write plugs when handling native zone append writes

Jørgen Hansen <Jorgen.Hansen@xxxxxxx> · Fri, 14 Feb 2025 10:38:31 +0000

> On 14 Feb 2025, at 05.14, Damien Le Moal <dlemoal@xxxxxxxxxx> wrote:
> 
> For devices that natively support zone append operations,
> REQ_OP_ZONE_APPEND BIOs are not processed through zone write plugging
> and are immediately issued to the zoned device. This means that there is
> no write pointer offset tracking done for these operations and that a
> zone write plug is not necessary.
> 
> However, when receiving a zone append BIO, we may already have a zone
> write plug for the target zone if that zone was previously partially
> written using regular write operations. In such case, since the write
> pointer offset of the zone write plug is not incremented by the amount
> of sectors appended to the zone, 2 issues arise:
> 1) we risk leaving the plug in the disk hash table if the zone is fully
>   written using zone append or regular write operations, because the
>   write pointer offset will never reach the "zone full" state.
> 2) Regular write operations that are issued after zone append operations
>   will always be failed by blk_zone_wplug_prepare_bio() as the write
>   pointer alignment check will fail, even if the user correctly
>   accounted for the zone append operations and issued the regular
>   writes with a correct sector.
> 
> Avoid these issues by immediately removing the zone write plug of zones
> that are the target of zone append operations when blk_zone_plug_bio()
> is called. The new function blk_zone_wplug_handle_native_zone_append()
> implements this for devices that natively support zone append. The
> removal of the zone write plug using disk_remove_zone_wplug() requires
> aborting all plugged regular write using disk_zone_wplug_abort() as
> otherwise the plugged write BIOs would never be executed (with the plug
> removed, the completion path will never see again the zone write plug as
> disk_get_zone_wplug() will return NULL). Rate-limited warnings are added
> to blk_zone_wplug_handle_native_zone_append() and to
> disk_zone_wplug_abort() to signal this.
> 
> Since blk_zone_wplug_handle_native_zone_append() is called in the hot
> path for operations that will not be plugged, disk_get_zone_wplug() is
> optimized under the assumption that a user issuing zone append
> operations is not at the same time issuing regular writes and that there
> are no hashed zone write plugs. The struct gendisk atomic counter
> nr_zone_wplugs is added to check this, with this counter incremented in
> disk_insert_zone_wplug() and decremented in disk_remove_zone_wplug().
> 
> To be consistent with this fix, we do not need to fill the zone write
> plug hash table with zone write plugs for zones that are partially
> written for a device that supports native zone append operations.
> So modify blk_revalidate_seq_zone() to return early to avoid allocating
> and inserting a zone write plug for partially written sequential zones
> if the device natively supports zone append.
> 
> Reported-by: Jorgen Hansen <Jorgen.Hansen@xxxxxxx>
> Fixes: 9b1ce7f0c6f8 ("block: Implement zone append emulation")
> Cc: stable@xxxxxxxxxxxxxxx
> Signed-off-by: Damien Le Moal <dlemoal@xxxxxxxxxx>
> ---
> block/blk-zoned.c      | 76 ++++++++++++++++++++++++++++++++++++++----
> include/linux/blkdev.h |  7 ++--
> 2 files changed, 73 insertions(+), 10 deletions(-)
> 
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 761ea662ddc3..0c77244a35c9 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -410,13 +410,14 @@ static bool disk_insert_zone_wplug(struct gendisk *disk,
>                }
>        }
>        hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
> +       atomic_inc(&disk->nr_zone_wplugs);
>        spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> 
>        return true;
> }
> 
> -static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
> -                                                 sector_t sector)
> +static struct blk_zone_wplug *disk_get_hashed_zone_wplug(struct gendisk *disk,
> +                                                        sector_t sector)
> {
>        unsigned int zno = disk_zone_no(disk, sector);
>        unsigned int idx = hash_32(zno, disk->zone_wplugs_hash_bits);
> @@ -437,6 +438,15 @@ static struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
>        return NULL;
> }
> 
> +static inline struct blk_zone_wplug *disk_get_zone_wplug(struct gendisk *disk,
> +                                                        sector_t sector)
> +{
> +       if (!atomic_read(&disk->nr_zone_wplugs))
> +               return NULL;
> +
> +       return disk_get_hashed_zone_wplug(disk, sector);
> +}
> +
> static void disk_free_zone_wplug_rcu(struct rcu_head *rcu_head)
> {
>        struct blk_zone_wplug *zwplug =
> @@ -503,6 +513,7 @@ static void disk_remove_zone_wplug(struct gendisk *disk,
>        zwplug->flags |= BLK_ZONE_WPLUG_UNHASHED;
>        spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>        hlist_del_init_rcu(&zwplug->node);
> +       atomic_dec(&disk->nr_zone_wplugs);
>        spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>        disk_put_zone_wplug(zwplug);
> }
> @@ -593,6 +604,11 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
> {
>        struct bio *bio;
> 
> +       if (bio_list_empty(&zwplug->bio_list))
> +               return;
> +
> +       pr_warn_ratelimited("%s: zone %u: Aborting plugged BIOs\n",
> +                           zwplug->disk->disk_name, zwplug->zone_no);
>        while ((bio = bio_list_pop(&zwplug->bio_list)))
>                blk_zone_wplug_bio_io_error(zwplug, bio);
> }
> @@ -1040,6 +1056,47 @@ static bool blk_zone_wplug_handle_write(struct bio *bio, unsigned int nr_segs)
>        return true;
> }
> 
> +static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
> +{
> +       struct gendisk *disk = bio->bi_bdev->bd_disk;
> +       struct blk_zone_wplug *zwplug;
> +       unsigned long flags;
> +
> +       /*
> +        * We have native support for zone append operations, so we are not
> +        * going to handle @bio through plugging. However, we may already have a
> +        * zone write plug for the target zone if that zone was previously
> +        * partially written using regular writes. In such case, we risk leaving
> +        * the plug in the disk hash table if the zone is fully written using
> +        * zone append operations. Avoid this by removing the zone write plug.
> +        */
> +       zwplug = disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
> +       if (likely(!zwplug))
> +               return;
> +
> +       spin_lock_irqsave(&zwplug->lock, flags);
> +
> +       /*
> +        * We are about to remove the zone write plug. But if the user
> +        * (mistakenly) has issued regular writes together with native zone
> +        * append, we must aborts the writes as otherwise the plugged BIOs would
> +        * not be executed by the plug BIO work as disk_get_zone_wplug() will
> +        * return NULL after the plug is removed. Aborting the plugged write
> +        * BIOs is consistent with the fact that these writes will most likely
> +        * fail anyway as there is no ordering guarantees between zone append
> +        * operations and regular write operations.
> +        */
> +       if (!bio_list_empty(&zwplug->bio_list)) {
> +               pr_warn_ratelimited("%s: zone %u: Invalid mix of zone append and regular writes\n",
> +                                   disk->disk_name, zwplug->zone_no);
> +               disk_zone_wplug_abort(zwplug);
> +       }
> +       disk_remove_zone_wplug(disk, zwplug);
> +       spin_unlock_irqrestore(&zwplug->lock, flags);
> +
> +       disk_put_zone_wplug(zwplug);
> +}
> +
> /**
>  * blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
>  * @bio: The BIO being submitted
> @@ -1096,8 +1153,10 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
>         */
>        switch (bio_op(bio)) {
>        case REQ_OP_ZONE_APPEND:
> -               if (!bdev_emulates_zone_append(bdev))
> +               if (!bdev_emulates_zone_append(bdev)) {
> +                       blk_zone_wplug_handle_native_zone_append(bio);
>                        return false;
> +               }
>                fallthrough;
>        case REQ_OP_WRITE:
>        case REQ_OP_WRITE_ZEROES:
> @@ -1284,6 +1343,7 @@ static int disk_alloc_zone_resources(struct gendisk *disk,
> {
>        unsigned int i;
> 
> +       atomic_set(&disk->nr_zone_wplugs, 0);
>        disk->zone_wplugs_hash_bits =
>                min(ilog2(pool_size) + 1, BLK_ZONE_WPLUG_MAX_HASH_BITS);
> 
> @@ -1338,6 +1398,7 @@ static void disk_destroy_zone_wplugs_hash_table(struct gendisk *disk)
>                }
>        }
> 
> +       WARN_ON_ONCE(atomic_read(&disk->nr_zone_wplugs));
>        kfree(disk->zone_wplugs_hash);
>        disk->zone_wplugs_hash = NULL;
>        disk->zone_wplugs_hash_bits = 0;
> @@ -1550,11 +1611,12 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
>        }
> 
>        /*
> -        * We need to track the write pointer of all zones that are not
> -        * empty nor full. So make sure we have a zone write plug for
> -        * such zone if the device has a zone write plug hash table.
> +        * If the device needs zone append emulation, we need to track the
> +        * write pointer of all zones that are not empty nor full. So make sure
> +        * we have a zone write plug for such zone if the device has a zone
> +        * write plug hash table.
>         */
> -       if (!disk->zone_wplugs_hash)
> +       if (!queue_emulates_zone_append(disk->queue) || !disk->zone_wplugs_hash)
>                return 0;
> 
>        disk_zone_wplug_sync_wp_offset(disk, zone);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 248416ecd01c..39cc5862cc48 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -196,10 +196,11 @@ struct gendisk {
>        unsigned int            zone_capacity;
>        unsigned int            last_zone_capacity;
>        unsigned long __rcu     *conv_zones_bitmap;
> -       unsigned int            zone_wplugs_hash_bits;
> -       spinlock_t              zone_wplugs_lock;
> +       unsigned int            zone_wplugs_hash_bits;
> +       atomic_t                nr_zone_wplugs;
> +       spinlock_t              zone_wplugs_lock;
>        struct mempool_s        *zone_wplugs_pool;
> -       struct hlist_head       *zone_wplugs_hash;
> +       struct hlist_head       *zone_wplugs_hash;
>        struct workqueue_struct *zone_wplugs_wq;
> #endif /* CONFIG_BLK_DEV_ZONED */
> 
> —
> 2.48.1
> 

Looks good to me.

I tested that the write patterns that failed previously now work, and also
verified that the wplug abort path on mixed zone append/write works as
expected.

Tested-by: Jorgen Hansen <Jorgen.Hansen@xxxxxxx>