Re: [PATCH] bio: limit bio max size.

Damien Le Moal <Damien.LeMoal@xxxxxxx> · Wed, 13 Jan 2021 05:53:21 +0000

On 2021/01/13 13:01, Changheun Lee wrote:
>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>> From: "Changheun Lee" <nanich.lee@xxxxxxxxxxx>
>>>>>
>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>> but sometimes it would lead to inefficient behaviors.
>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>> bio max size should be limited as a proper size.
>>>>
>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>> split on submit should give you better throughput for large IOs compared to
>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>
>>>> Do you have a specific case where you see higher performance with this patch
>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>> considering that many hardware can execute larger IOs than that.
>>>>
>>>
>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>> is merged into a bio structure.
>>> And elapsed time to merge complete was about 2ms.
>>> It means first bio-submit is after 2ms.
>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>> 100us by bio_full operation.
>>
>> bio_submit() will split the large BIO case into multiple requests while the
>> small BIO case will likely result one or two requests only. That likely explain
>> the time difference here. However, for the large case, the 2ms will issue ALL
>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>> will need 32 different bio_submit() calls. So what is the actual total latency
>> difference for the entire 32MB user IO ? That is I think what needs to be
>> compared here.
>>
>> Also, what is your device max_sectors_kb and max queue depth ?
>>
> 
> 32MB total latency is about 19ms including merge time without this patch.
> But with this patch, total latency is about 17ms including merge time too.
> Actually 32MB read time from device is same - about 16.7ms - in driver layer.
> No need to hold more I/O than max_sectors_kb during bio merge.
> My device is UFS. and max_sectors_kb is 1MB, queue depth is 32.

This may be due to the CPU being slow: it takes time to build the 32MB BIO,
during which the device is idling. With the 1MB BIO limit, the first BIO hits
the drive much more quickly, reducing idle time of the device and leading to
higher throughput. But there are 31 more BIOs to build and issue after the first
one... That does create a pipeline of requests keeping the device busy, but that
also likely keeps your CPU a lot more busy building these additional BIOs.
Overall, do you see a difference in CPU load for the 32MB BIO case vs the 1MB
max BIO case ? Any increase in CPU load with the lower BIO size limit ?

> 
>>> It's not large delay and can't be observed with low speed device.
>>> But it's needed to reduce merge delay for high speed device.
>>> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
>>> with this patch on android platform.
>>> As you said, 1MB might be small for some device.
>>> But method is needed to re-size, or select the bio max size.
>>
>> At the very least, I think that such limit should not be arbitrary as your patch
>> proposes but rely on the device characteristics (e.g.
>> max_hw_sectors_kb/max_sectors_kb and queue depth).
>>
> 
> I agree with your opinion, I thought same as your idea. For that, deep research
> is needed, proper timing to set and bio structure modification, etc ...

Why would you need any BIO structure modifications ? Your patch is on the right
track if limiting the BIO size is the right solution (I am not still completely
convinced). E.g., the code:

	if (page_is_mergeable(bv, page, len, off, same_page)) {
		if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
			*same_page = false;
			return false;
	}

could just become:

	if (page_is_mergeable(bv, page, len, off, same_page)) {
		if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
			*same_page = false;
			return false;
	}

With bio_max_size() being something like:

static inline size_t bio_max_size(struct bio *bio)
{
	sector_t max_sectors = blk_queue_get_max_sectors(bio->bi_disk->queue,
						         bio_op(bio));

	return max_sectors << SECTOR_SHIFT;
}

Note that this is not super efficient as a BIO maximum size depends on the BIO
offset too (its start sector). So writing something similar to
blk_rq_get_max_sectors() would probably be better.

> Current is simple patch for default bio max size.
> Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by BIO_MAX_PAGES.
> So I think 1MB bio max size is reasonable as a default.

max_sectors_kb is always defined for any block device so I do not think there is
a need for any arbitrary default value.

Since such optimization likely very much depend on the speed of the system CPU
and of the storage device used, it may be a good idea to have this configurable
through sysfs. That is, bio_max_size() simply returns UINT_MAX leading to no
change from the current behavior if the optimization is disabled (default) and
max_sectors_kb if it is enabled.

> 
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Changheun Lee <nanich.lee@xxxxxxxxxxx>
>>>>> ---
>>>>>  block/bio.c         | 2 +-
>>>>>  include/linux/bio.h | 3 ++-
>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>>> --- a/block/bio.c
>>>>> +++ b/block/bio.c
>>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>>>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>  
>>>>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>>  				*same_page = false;
>>>>>  				return false;
>>>>>  			}
>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>>> --- a/include/linux/bio.h
>>>>> +++ b/include/linux/bio.h
>>>>> @@ -20,6 +20,7 @@
>>>>>  #endif
>>>>>  
>>>>>  #define BIO_MAX_PAGES		256
>>>>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>>>>  
>>>>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>>>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>>>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>  		return true;
>>>>>  
>>>>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>>  		return true;
>>>>>  
>>>>>  	return false;
>>>>>
>>>>
>>>>
>>>> -- 
>>>> Damien Le Moal
>>>> Western Digital Research
>>>
>>
>>
>> -- 
>> Damien Le Moal
>> Western Digital Research
>>
> 
> ---
> Changheun Lee
> Samsung Electronics
> 
> 

-- 
Damien Le Moal
Western Digital Research