Re: [PATCH v3 06/10] scsi: sd_zbc: emulate ZONE_APPEND commands

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020/03/28 17:51, Christoph Hellwig wrote:
>> Since zone reset and finish operations can be issued concurrently with
>> writes and zone append requests, ensure a coherent update of the zone
>> write pointer offsets by also write locking the target zones for these
>> zone management requests.
> 
> While they can be issued concurrently you can't expect sane behavior
> in that case.  So I'm not sure why we need the zone write lock in this
> case.

The behavior will certainly not be sane for the buggy application doing writes
and resets to the same zone concurrently (I have debugged that several time in
the field). So I am not worried about that at all. The zone write lock here is
still used to make sure the wp cache stays in sync with the drive. Without it,
we could have races on completion update of the wp and get out of sync.

> 
>> +++ b/drivers/scsi/sd.c
>> @@ -1215,6 +1215,12 @@ static blk_status_t sd_setup_read_write_cmnd(struct scsi_cmnd *cmd)
>>  	else
>>  		protect = 0;
>>  
>> +	if (req_op(rq) == REQ_OP_ZONE_APPEND) {
>> +		ret = sd_zbc_prepare_zone_append(cmd, &lba, nr_blocks);
>> +		if (ret)
>> +			return ret;
>> +	}
> 
> I'd move this up a few lines to keep all the PI related code together.
> 
>> +#define SD_ZBC_INVALID_WP_OFST	~(0u)
>> +#define SD_ZBC_UPDATING_WP_OFST	(SD_ZBC_INVALID_WP_OFST - 1)
> 
> Given that this goes into the seq_zones_wp_ofst shouldn't the block
> layer define these values?

We could, at least the first one. The second one is really something that could
be considered completely driver dependent since other drivers doing this
emulation may handle the updating state differently.

Since this is the only driver where this is needed, may be we can keep this here
for now ?

> 
>> +struct sd_zbc_zone_work {
>> +	struct work_struct work;
>> +	struct scsi_disk *sdkp;
>> +	unsigned int zno;
>> +	char buf[SD_BUF_SIZE];
>> +};
> 
> Wouldn't it make sense to have one work_struct per scsi device and batch
> updates?  That is also query a decenent sized buffer with a bunch of
> zones and update them all at once?  Also given that the other write
> pointer caching code is in the block layer, why is this in SCSI?

Again, because we thought this is driver dependent in the sense that other
drivers may want to handle invalid WP entries differently. Also, I think that
one work struct per device may be an overkill. This is for error recovery and on
a normal healthy systems, write errors are rare.

> 
>> +	spin_lock_bh(&sdkp->zone_wp_ofst_lock);
>> +
>> +	wp_ofst = rq->q->seq_zones_wp_ofst[zno];
>> +
>> +	if (wp_ofst == SD_ZBC_UPDATING_WP_OFST) {
>> +		/* Write pointer offset update in progress: ask for a requeue */
>> +		ret = BLK_STS_RESOURCE;
>> +		goto err;
>> +	}
>> +
>> +	if (wp_ofst == SD_ZBC_INVALID_WP_OFST) {
>> +		/* Invalid write pointer offset: trigger an update from disk */
>> +		ret = sd_zbc_update_wp_ofst(sdkp, zno);
>> +		goto err;
>> +	}
>> +
>> +	wp_ofst = sectors_to_logical(sdkp->device, wp_ofst);
>> +	if (wp_ofst + nr_blocks > sdkp->zone_blocks) {
>> +		ret = BLK_STS_IOERR;
>> +		goto err;
>> +	}
>> +
>> +	/* Set the LBA for the write command used to emulate zone append */
>> +	*lba += wp_ofst;
>> +
>> +	spin_unlock_bh(&sdkp->zone_wp_ofst_lock);
> 
> This seems like a really good use case for cmpxchg.  But I guess
> premature optimization is the root of all evil, so let's keep this in
> mind for later.

OK.

> 
>> +	/*
>> +	 * For zone append, the zone was locked in sd_zbc_prepare_zone_append().
>> +	 * For zone reset and zone finish, the zone was locked in
>> +	 * sd_zbc_setup_zone_mgmt_cmnd().
>> +	 * For regular writes, the zone is unlocked by the block layer elevator.
>> +	 */
>> +	return req_op(rq) == REQ_OP_ZONE_APPEND ||
>> +		req_op(rq) == REQ_OP_ZONE_RESET ||
>> +		req_op(rq) == REQ_OP_ZONE_FINISH;
>> +}
>> +
>> +static bool sd_zbc_need_zone_wp_update(struct request *rq)
>> +{
>> +	if (req_op(rq) == REQ_OP_WRITE ||
>> +	    req_op(rq) == REQ_OP_WRITE_ZEROES ||
>> +	    req_op(rq) == REQ_OP_WRITE_SAME)
>> +		return blk_rq_zone_is_seq(rq);
>> +
>> +	if (req_op(rq) == REQ_OP_ZONE_RESET_ALL)
>> +		return true;
>> +
>> +	return sd_zbc_zone_needs_write_unlock(rq);
> 
> To me all this would look cleaner with a switch statement:
> 
> static bool sd_zbc_need_zone_wp_update(struct request *rq)
> 
> 	switch (req_op(rq)) {
> 	case REQ_OP_ZONE_APPEND:
> 	case REQ_OP_ZONE_FINISH:
> 	case REQ_OP_ZONE_RESET:
> 	case REQ_OP_ZONE_RESET_ALL:
> 		return true;
> 	case REQ_OP_WRITE:
> 	case REQ_OP_WRITE_ZEROES:
> 	case REQ_OP_WRITE_SAME:
> 		return blk_rq_zone_is_seq(rq);
> 	default:
> 		return false;
> 	}
> }

Yes, it looks better this way.

> 
>> +	if (!sd_zbc_need_zone_wp_update(rq))
>> +		goto unlock_zone;
> 
> Split the wp update into a little helper?

Yes. And if we move the spinlock to the block layer as you suggest below, then
we can have that helper generic in blk-zoned.c too.

> 
>> +void sd_zbc_init_disk(struct scsi_disk *sdkp)
>> +{
>> +	if (!sd_is_zoned(sdkp))
>> +		return;
>> +
>> +	spin_lock_init(&sdkp->zone_wp_ofst_lock);
> 
> Shouldn't this lock also go into the block code where the cached
> write pointer lives?
> 


-- 
Damien Le Moal
Western Digital Research




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux