Re: [PATCH 08/26] block: Implement zone append emulation

Damien Le Moal <dlemoal@xxxxxxxxxx> · Mon, 5 Feb 2024 09:10:14 +0900

On 2/4/24 21:24, Hannes Reinecke wrote:
> On 2/2/24 15:30, Damien Le Moal wrote:
>> +/*
>> + * Set a zone write plug write pointer offset to either 0 (zone reset case)
>> + * or to the zone size (zone finish case). This aborts all plugged BIOs, which
>> + * is fine to do as doing a zone reset or zone finish while writes are in-flight
>> + * is a mistake from the user which will most likely cause all plugged BIOs to
>> + * fail anyway.
>> + */
>> +static void blk_zone_wplug_set_wp_offset(struct gendisk *disk,
>> +					 struct blk_zone_wplug *zwplug,
>> +					 unsigned int wp_offset)
>> +{
>> +	/*
>> +	 * Updating the write pointer offset puts back the zone
>> +	 * in a good state. So clear the error flag and decrement the
>> +	 * error count if we were in error state.
>> +	 */
>> +	if (zwplug->flags & BLK_ZONE_WPLUG_ERROR) {
>> +		zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>> +		atomic_dec(&disk->zone_nr_wplugs_with_error);
>> +	}
>> +
>> +	/* Update the zone write pointer and abort all plugged BIOs. */
>> +	zwplug->wp_offset = wp_offset;
>> +	blk_zone_wplug_abort(disk, zwplug);
>> +}

[...]

>> +static void blk_zone_wplug_handle_error(struct gendisk *disk,
>> +					struct blk_zone_wplug *zwplug)
>> +{
>> +	unsigned int zno = zwplug - disk->zone_wplugs;
>> +	sector_t zone_start_sector = bdev_zone_sectors(disk->part0) * zno;
>> +	unsigned int noio_flag;
>> +	struct blk_zone zone;
>> +	unsigned long flags;
>> +	int ret;
>> +
>> +	/* Check if we have an error and clear it if we do. */
>> +	blk_zone_wplug_lock(zwplug, flags);
>> +	if (!(zwplug->flags & BLK_ZONE_WPLUG_ERROR))
>> +		goto unlock;
>> +	zwplug->flags &= ~BLK_ZONE_WPLUG_ERROR;
>> +	atomic_dec(&disk->zone_nr_wplugs_with_error);
>> +	blk_zone_wplug_unlock(zwplug, flags);
>> +
> 
> Don't you need to quiesce the drive here?
> After all, I/O (or a reset zone) might be executed after the call to 
> report zones, but before we lock the zone, no?

Indeed, this is racy with reset zone. But there is no race with IOs because when
the error flag is set, we always plug incoming BIOs.
But the race with reset (and finish) is actually easy to fix. All I need to do
is not clear the error flag above and check for it after the report zones and
locking the zone plug. Given that blk_zone_wplug_set_wp_offset() clears the
error flag, we end up restoring a known good wp either from the reset or from
the report zones.

>>   struct blk_revalidate_zone_args {
>> @@ -890,6 +1292,8 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
>>   			if (!args->seq_zones_wlock)
>>   				return -ENOMEM;
>>   		}
>> +		args->zone_wplugs[idx].capacity = zone->capacity;
>> +		args->zone_wplugs[idx].wp_offset = blk_zone_wp_offset(zone);
>>   		break;
>>   	case BLK_ZONE_TYPE_SEQWRITE_PREF:
>>   	default:
>> @@ -964,6 +1368,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>   	if (!args.zone_wplugs)
>>   		goto out_restore_noio;
>>   
>> +	if (!disk->zone_wplugs) {
>> +		mutex_init(&disk->zone_wplugs_mutex);
>> +		atomic_set(&disk->zone_nr_wplugs_with_error, 0);
>> +		INIT_DELAYED_WORK(&disk->zone_wplugs_work,
>> +				  disk_zone_wplugs_work);
>> +	}
>> +
> 
> Same question here about device quiesce ...

Yes, I need to check this, together with revisiting the queue usage counter
handling.

>>   	ret = disk->fops->report_zones(disk, 0, UINT_MAX,
>>   				       blk_revalidate_zone_cb, &args);
>>   	if (!ret) {
>> @@ -989,12 +1400,14 @@ int blk_revalidate_disk_zones(struct gendisk *disk,
>>   	 */
>>   	blk_mq_freeze_queue(q);
> 
> And this, I guess, comes to late.
> We've already read the zone list, so any write I/O submitted after the 
> report zone but befor here will cause things to be iffy.

Yes, but the driver is supposed to guarantee that this function is being called
while there are no writes in flight. DM is OK with that. I think scsi is too,
but need to check again. Not sure about NVMe and null_blk. But Ming had a good
point about the usage ref coming from the device being open. So a quiesce may be
enough here. Need to revisit this.

-- 
Damien Le Moal
Western Digital Research