Re: block: wrong return value by bio_end_sector?

Paolo Valente <paolo.valente@xxxxxxxxxx> · Tue, 4 Oct 2022 10:46:27 +0200

> Il giorno 3 ott 2022, alle ore 00:33, Damien Le Moal <Damien.LeMoal@xxxxxxx> ha scritto:
> 
> On 2022/10/03 1:20, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 1 ott 2022, alle ore 02:50, Damien Le Moal <damien.lemoal@xxxxxxxxxxxxxxxxxx> ha scritto:
>>> 
>>> On 10/1/22 00:59, Paolo Valente wrote:
>>>> Hi Jens, Damien, all other possibly interested people, this is to raise
>>>> attention on a mistake that has emerged in a thread on a bfq extension
>>>> for multi-actuary drives [1].
>>>> 
>>>> The mistake is apparently in the macro bio_end_sector (defined in 
>>>> include/linux/bio.h), which seems to be translated (incorrectly) as 
>>>> sector+size, and not as sector+size-1.
>>> 
>>> This has been like this for a long time, I think.
>>> 
>>>> 
>>>> For your convenience, I'm pasting a detailed description of the 
>>>> problem, by Tyler (description taken from the above thread [1]).
>>>> 
>>>> The drive reports the actuator ranges as a starting LBA and a count of
>>>> LBAs for the range. If the code reading the reported values simply does
>>>> startingLBA + range, this is an incorrect ending LBA for that actuator.
>>> 
>>> Well, yes. LBA 0 + drive capacity is also an incorrect LBA. If the code
>>> assumes that it is, you have a classic off-by-one bug.
>>> 
>>>> This is because LBAs are zero indexed and this simple addition is not
>>>> taking that into account. The proper way to get the endingLBA is
>>>> startingLBA + range - 1 to get the last LBA value for where to issue a
>>>> final IO read/write to account for LBA values starting at zero rather
>>>> than one.
>>> 
>>> Yes. And ? Where is the issue ?
>>> 
>>>> 
>>>> Here is an example from the output in SeaChest/openSeaChest: 
>>>> ====Concurrent Positioning Ranges====
>>>> 
>>>> Range#     #Elements            Lowest LBA          # of LBAs 0
>>>> 1                                               0
>>>> 17578328064 1            1                         17578328064
>>>> 17578328064
>>>> 
>>>> If using the incorrect formula to get the final LBA for actuator 0, you
>>>> would get 17578328064, but this is the starting LBA reported by the
>>>> drive for actuator 1. So to be consistent for all ranges, the final LBA
>>>> for a given actuator should be calculated as starting LBA + range - 1.
>>>> 
>>>> I had reached out to Seagate's T10 and T13 representatives for
>>>> clarification and verification and this is most likely what is causing
>>>> the error is a missing - 1 somewhere after getting the information
>>>> reported by the device. They agreed that the reporting from the drive
>>>> and the SCSI to ATA translation is correct.
>>>> 
>>>> I'm not sure where this is being read and calculated, but it is not an
>>>> error in the low-level libata or sd level of the kernel. It may be in
>>>> bfq, or it may be in some other place after the sd layer. I know there
>>>> were some additions to read this and report it up the stack, but I did
>>>> not think those were wrong as they seemed to pass the drive reported
>>>> information up the stack.
>>>> 
>>>> Jens, Damien, can you shed a light on this?
>>> 
>>> I am not clear on what the problem is exactly. This all sound like a
>>> simple off-by-one issue if bfq support code. No ?
>> 
>> Yes, it's (also) in bfq code now; no, it's not only in bfq code. 
>> 
>> The involved bfq code is a replica of the following original function
>> (living in block/blk-iaranges.c):
>> 
>> static struct blk_independent_access_range *
>> disk_find_ia_range(struct blk_independent_access_ranges *iars,
>> 		  sector_t sector)
>> {
>> 	struct blk_independent_access_range *iar;
>> 	int i;
>> 
>> 	for (i = 0; i < iars->nr_ia_ranges; i++) {
>> 		iar = &iars->ia_range[i];
>> 		if (sector >= iar->sector &&
>> 		    sector < iar->sector + iar->nr_sectors)
>> 			return iar;
>> 	}
>> 
>> 	return NULL;
>> }
>> 
>> bfq's replica simply contains also this warning, right before the return NULL:
>> 
>> 	WARN_ONCE(true,
>> 		  "bfq_actuator_index: bio sector out of ranges: end=%llu\n",
>> 		  end);
>> 
>> So, if this warning is triggered, then the sector does not belong to
>> any range.
>> 
>> That warning gets actually triggered [1], for a sector number that is
>> larger than the largest extreme (iar->sector + iar->nr_sectors) in the
>> iar data structure.  The offending value resulted to be simply equal
>> to the largest possible value accepted by the iar data structure, plus
>> one.
>> 
>> So, yes, this is an off-by-one error.  More precisely, there is a
>> mismatch between the above code (or the values stored the iar data
>> structure) and the value of bio_end_sector (the latter is passed as
>> input to the above function).  bio_end_sector does not seem to give
>> the end sector of a request (equal to begin+size-1), as apparently
>> expected by the above code, but the sector after the last one (namely,
>> begin+size).
>> 
>> Should I express an opinion on where the error is, I would say that
>> the mistake is in bio_end_sector.  But I could be totally wrong, as
>> I'm not a expert of either of the involved parts.  In addition,
>> bio_end_sector is already in use, with its current formula, by other
>> parts of the block layer.  If you guys (Damien, Jens, Tyler?, ...)
>> give us some guidance on what to fix exactly, we will be happy to make
>> a fix.
> 
> I do not think there is any error/problem anywhere with bio_end_sector(). But
> indeed, the name is slightly misleading as it doe not return the last sector
> processed by the bio but the next one. The reason is that with this
> implementation, bio_end_sector() always gives you the first sector for the next
> bio with a multi-bio request.
> 
> When you need the last sector processed by the bio, you need to use
> bio_end_sector(bio) - 1.
> 
> So to find the access range that served a completed bio, you simply need to call:
> 
> disk_find_ia_range(iars, bio_end_sector(bio) - 1);
> 

Thank you very much, I got it now.

> You probably could add a couple of helper functions for getting an access range
> from a bio sector or end sector. Something like:
> 
> static inline struct blk_independent_access_range *
> __bio_sector_access_range(struct bio *bio, sector_t sector)
> {
> 	struct gendisk *disk = bio->bi_bdev->bd_disk;
> 
> 	if (!disk->ia_ranges)
> 		return NULL;
> 
> 	iar = disk_find_ia_range(disk->ia_ranges, sector);
> 	WARN_ONCE(!iar,
> 		  "bfq_actuator_index: bio sector out of ranges: end=%llu\n",
> 		  end);
> 
> 	return iar;
> }
> 
> static inline struct blk_independent_access_range *
> bio_sector_access_range(struct bio *bio)
> {
> 	return __bio_sector_access_range(bio, bio_sector(bio));
> }
> 
> static inline struct blk_independent_access_range *
> bio_end_sector_access_range(struct bio *bio)
> {
> 	return __bio_sector_access_range(bio, bio_end_sector(bio) - 1);
> }
> 
> or similar. Then your code will be simpler and much less worries about of-by-one
> bugs.
> 

For the moment, bio_end_sector is invoked in only one place (for the
range-computation stuff).  So, for simplicity, I'd go for just
appending a -1 after such invocation.  To give you credits, I'm adding
you in a reviewed-by tag.  If anything of this is wrong, I'm willing
to change it.  I'll send a V2 soon.

Thanks,
Paolo

>> 
>> Thanks,
>> Paolo
>> 
>> 
>> 
>>> 
>>>> 
>>>> Thanks, Paolo
>>>> 
>>>> [1] https://www.spinics.net/lists/kernel/msg4507408.html
>>> 
>>> -- 
>>> Damien Le Moal
>>> Western Digital Research
>> 
>> 
> 
> -- 
> Damien Le Moal
> Western Digital Research