Re: [RFC PATCH v4 1/3] bcache: export bcache zone information for zoned backing device

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2020/6/2 16:48, Damien Le Moal wrote:
> On Mon, 2020-06-01 at 20:34 +0800, Coly Li wrote:
>> [...]
>>>> +
>>>> +	/* convert back to LBA of the bcache device*/
>>>> +	zone->start -= data_offset;
>>>> +	zone->wp -= data_offset;
>>>
>>> This has to be done depending on the zone type and zone condition: zone->wp is
>>> "invalid" for conventional zones, and sequential zones that are full, read-only
>>> or offline. So you need something like this:
>>>
>>> 	/* Remap LBA to the bcache device */
>>> 	zone->start -= data_offset;
>>> 	switch(zone->cond) {
>>> 	case BLK_ZONE_COND_NOT_WP:
>>> 	case BLK_ZONE_COND_READONLY:
>>> 	case BLK_ZONE_COND_FULL:
>>> 	case BLK_ZONE_COND_OFFLINE:
>>> 		break;
>>> 	case BLK_ZONE_COND_EMPTY:
>>> 		zone->wp = zone->start;
>>> 		break;
>>> 	default:
>>> 		zone->wp -= data_offset;
>>> 		break;
>>> 	}
>>>
>>> 	return args->orig_cb(zone, idx, args->orig_data);
>>>
>>
>> Hmm, do we have a unified spec to handle the wp on different zone
>> condition ?
> 
> This comes from SCSI ZBC and ATA ZAC specifications, version r05 of the
> documents for both (ratified specifications). These 2 specifications
> define identical semantic and states for SMR zones. Upcoming NVMe ZNS
> has similar and compatible zone management too.
> 
>> In zonefs I see zone->wp sets to zone->start for,
>> - BLK_ZONE_COND_OFFLINE
>> - BLK_ZONE_COND_READONLY
> 
> This is because zonefs uses a zone write pointer position for the file
> size: file size = zone->wp - zone->start;
> For the read-only and offline zone conditions, since the wp given by
> the drive is defined as "invalid", so unknown, zonefs changes the wp to
> zone start to make the file size 0 and prevent any read IO issuing.
> 
>>
>> In sd_zbc.c, I see wp sets to end of the zone for
>> - BLK_ZONE_COND_FULL
> 
> This is only to be nice to host software users, so that the wp position
> relative to the zone start is exactly the zone size, indicating that
> the entire zone was written. This allows to have equivalence between
> the tests
> 
> zone->cond == BLK_ZONE_COND_FULL
> 
> and
> 
> zone->wp == zone->start + zone->len
> 
> to make coding easier. But the fact remain that the value given for the
> wp of a full zone by the disk is undefined. Generally, drives return
> 0xfffff...ffff.
> 

Copied. Thanks for the above information.


>> And in dm.c I see zone->wp is set to zone->start for,
>> - BLK_ZONE_COND_EMPTY
> 
> That is because like bcache, dm may remap zones to indicate sector
> values that make sense for the target device. The values given by the
> physical underlying disk cannot be used as is. The code snippet above I
> sent does the same for the same reason.
> 
>> It seems except for BLK_ZONE_COND_NOT_WP, for other conditions the
>> writer pointer should be taken cared by device driver already.
> 
> The physical drive device driver gives you what the disk indicated,
> modulo the change for full zone. Bcache *is* a device driver too,
> driver of the target device. The report zones method for that device
> needs to take care of the zone remapping if needed. That is not for the
> underlying physical device driver lower in the stack to do this as that
> layer does not know how the drive is being used.
> 


Copied, thanks for the information :-)

>> So I write such switch-case in the following way by the inspair of your
>> comments,
>>
>>         /* Zone 0 should not be reported */
>>         if(WARN_ON_ONCE(zone->start < data_offset))
>>                 return -EIO;
> 
> That depends on the start sector specified when blkdev_report_zones()
> is called. Beware to have the start sector being >= zone size. That is,
> that report start sector needs remapping too.
> 

Because zones before data_offset should not visible by upper layer code,
such zones should not be sent to cached_dev_report_zones_cb(). This is a
double check that caller of cached_dev_report_zones_cb() does things in
correct way.


>>         /* convert back to LBA of the bcache device*/
>>         zone->start -= data_offset;
>>         if (zone->cond != BLK_ZONE_COND_NOT_WP)
>>                 zone->wp -= data_offset;
> 
> Since you have the BLK_ZONE_COND_NOT_WP condition in the switch-case,
> this is not needed. 
> 
>>
>>         switch (zone->cond) {
>>         case BLK_ZONE_COND_NOT_WP:
>>                 /* No write pointer available */
>>                 break;
>>         case BLK_ZONE_COND_EMPTY:
> 
> zone->wp = zone->start;
> 

Correct me if I am wrong. I assume when the zone is in empty condition,
LBA of zone->wp should be exactly equal to zone->start before bcache
does the conversion. Therefore 'zone->wp - data_offset' should still be
equal to 'zone->start - data_offset'. Therefore explicitly handle it in
'case BLK_ZONE_COND_EMPTY:' should be equal to handle it in 'default:' part.

Just want to double check I understand the wp correctly, thanks.


> needs to ba added here so that your wp is remapped.
> 
>>         case BLK_ZONE_COND_READONLY:
>>         case BLK_ZONE_COND_OFFLINE:
>>                 /*
>>                  * If underlying device driver does not properly
>>                  * set writer pointer, warn and fix it.
>>                  */
>>                 if (WARN_ON_ONCE(zone->wp != zone->start))
>>                         zone->wp = zone->start;
> 
> This is not needed. The wp value is undefined for these conditions.
> touching it, looking at it or even thinking of it does not make sense
> :) So leave the wp as is for these 2 cases.

Copied.

> 
>>                 break;
>>         case BLK_ZONE_COND_FULL:
>>                 /*
>>                  * If underlying device driver does not properly
>>                  * set writer pointer, warn and fix it.
>>                  */
>>                 if (WARN_ON_ONCE(zone->wp != zone->start + zone->len))
>>                         zone->wp = zone->start + zone->len;
> 
> Simply unconditionally set the wp to zone->start + zone->len. No WARN
> ON needed at all. I actually forgot this case in the code snippet I
> sent.

Copied.


> 
>>                 break;
>>         default:
>>                 break;
>>         }
>>
>>         return args->orig_cb(zone, idx, args->orig_data);
>>
>> The above code converts zone->wp by minus data_offset for
>> non-BLK_ZONE_COND_NOT_WP condition. And for other necessary conditions,
>> I just check whether the underlying device driver updates write pointer
>> properly (as I observed in other drivers), if not then drop a warning
>> and fix the write pointer to the expected value.
> 
> The wp essentially comes from the drive. Except for a full zone, the
> value is passed on as is by the driver. The physical device driver does
> not "set" the wp.
> 
>> Now the write pointer is only fixed when it was wrong value. If the
>> underlying device driver does not maintain the value properly, we figure
>> out and fix it.
> 
> The wp will be a wrong value only if the drive you are using has a
> firmware bug. The "fixing" needed is because bcache device needs
> remapping (off-by-one-zone) of the physical device zone
> information. That is for bcache to do. And that remapping needs to be
> done carefully since some wp values are undefined. Your code above as
> is will for instance change the wp value from "undefined"
> (0xffffffffffffffff for drives out there, generally, but it could be
> anything) to "undefined - data_offset" for any readonly or offline
> zone. The result is still "undefined"... Doing that kind of operation
> on the wp is not necessary at all and does not serve any useful
> purpose.

I see. Now it is more clear to me. Thanks.

Coly Li




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux