Re: Re: [PATCH] bio: limit bio max size.

Changheun Lee <nanich.lee@xxxxxxxxxxx> · Wed, 13 Jan 2021 15:39:36 +0900

>On 2021/01/13 13:01, Changheun Lee wrote:
>>> On 2021/01/12 21:14, Changheun Lee wrote:
>>>>> On 2021/01/12 17:52, Changheun Lee wrote:
>>>>>> From: "Changheun Lee" <nanich.lee@xxxxxxxxxxx>
>>>>>>
>>>>>> bio size can grow up to 4GB when muli-page bvec is enabled.
>>>>>> but sometimes it would lead to inefficient behaviors.
>>>>>> in case of large chunk direct I/O, - 64MB chunk read in user space -
>>>>>> all pages for 64MB would be merged to a bio structure if memory address is
>>>>>> continued phsycally. it makes some delay to submit until merge complete.
>>>>>> bio max size should be limited as a proper size.
>>>>>
>>>>> But merging physically contiguous pages into the same bvec + later automatic bio
>>>>> split on submit should give you better throughput for large IOs compared to
>>>>> having to issue a bio chain of smaller BIOs that are arbitrarily sized and will
>>>>> likely need splitting anyway (because of DMA boundaries etc).
>>>>>
>>>>> Do you have a specific case where you see higher performance with this patch
>>>>> applied ? On Intel, BIO_MAX_SIZE would be 1MB... That is arbitrary and too small
>>>>> considering that many hardware can execute larger IOs than that.
>>>>>
>>>>
>>>> When I tested 32MB chunk read with O_DIRECT in android, all pages of 32MB
>>>> is merged into a bio structure.
>>>> And elapsed time to merge complete was about 2ms.
>>>> It means first bio-submit is after 2ms.
>>>> If bio size is limited with 1MB with this patch, first bio-submit is about
>>>> 100us by bio_full operation.
>>>
>>> bio_submit() will split the large BIO case into multiple requests while the
>>> small BIO case will likely result one or two requests only. That likely explain
>>> the time difference here. However, for the large case, the 2ms will issue ALL
>>> requests needed for processing the entire 32MB user IO while the 1MB bio case
>>> will need 32 different bio_submit() calls. So what is the actual total latency
>>> difference for the entire 32MB user IO ? That is I think what needs to be
>>> compared here.
>>>
>>> Also, what is your device max_sectors_kb and max queue depth ?
>>>
>> 
>> 32MB total latency is about 19ms including merge time without this patch.
>> But with this patch, total latency is about 17ms including merge time too.
>> Actually 32MB read time from device is same - about 16.7ms - in driver layer.
>> No need to hold more I/O than max_sectors_kb during bio merge.
>> My device is UFS. and max_sectors_kb is 1MB, queue depth is 32.
>
>This may be due to the CPU being slow: it takes time to build the 32MB BIO,
>during which the device is idling. With the 1MB BIO limit, the first BIO hits
>the drive much more quickly, reducing idle time of the device and leading to
>higher throughput. But there are 31 more BIOs to build and issue after the first
>one... That does create a pipeline of requests keeping the device busy, but that
>also likely keeps your CPU a lot more busy building these additional BIOs.
>Overall, do you see a difference in CPU load for the 32MB BIO case vs the 1MB
>max BIO case ? Any increase in CPU load with the lower BIO size limit ?
>

CPU load is more than 32MB bio size. Because bio_sumbit progress is doing in parallel.
But I tested it same CPU operation frequency, So there are no difference of CPU speed.
Logically, bio max size is 4GB now. it's not realistic I know, but 4GB merge to a bio
will takes much time even if CPU works fast.
This is why I think bio max size should be limited.

>> 
>>>> It's not large delay and can't be observed with low speed device.
>>>> But it's needed to reduce merge delay for high speed device.
>>>> I improved 512MB sequential read performance from 1900MB/s to 2000MB/s
>>>> with this patch on android platform.
>>>> As you said, 1MB might be small for some device.
>>>> But method is needed to re-size, or select the bio max size.
>>>
>>> At the very least, I think that such limit should not be arbitrary as your patch
>>> proposes but rely on the device characteristics (e.g.
>>> max_hw_sectors_kb/max_sectors_kb and queue depth).
>>>
>> 
>> I agree with your opinion, I thought same as your idea. For that, deep research
>> is needed, proper timing to set and bio structure modification, etc ...
>
>Why would you need any BIO structure modifications ? Your patch is on the right
>track if limiting the BIO size is the right solution (I am not still completely
>convinced). E.g., the code:
>
>if (page_is_mergeable(bv, page, len, off, same_page)) {
>if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>*same_page = false;
>return false;
>}
>
>could just become:
>
>if (page_is_mergeable(bv, page, len, off, same_page)) {
>if (bio->bi_iter.bi_size > bio_max_size(bio) - len) {
>*same_page = false;
>return false;
>}
>
>With bio_max_size() being something like:
>
>static inline size_t bio_max_size(struct bio *bio)
>{
>sector_t max_sectors = blk_queue_get_max_sectors(bio->bi_disk->queue,
>bio_op(bio));
>
>return max_sectors << SECTOR_SHIFT;
>}
>
>Note that this is not super efficient as a BIO maximum size depends on the BIO
>offset too (its start sector). So writing something similar to
>blk_rq_get_max_sectors() would probably be better.

Good suggestion. :)

>
>> Current is simple patch for default bio max size.
>> Before applying of multipage bvec, bio max size was 1MB in kernel 4.x by BIO_MAX_PAGES.
>> So I think 1MB bio max size is reasonable as a default.
>
>max_sectors_kb is always defined for any block device so I do not think there is
>a need for any arbitrary default value.
>
>Since such optimization likely very much depend on the speed of the system CPU
>and of the storage device used, it may be a good idea to have this configurable
>through sysfs. That is, bio_max_size() simply returns UINT_MAX leading to no
>change from the current behavior if the optimization is disabled (default) and
>max_sectors_kb if it is enabled.
>

OK, I agree with you. It will be best for all now.
I'll try to make this.

>> 
>>>>
>>>>>
>>>>>>
>>>>>> Signed-off-by: Changheun Lee <nanich.lee@xxxxxxxxxxx>
>>>>>> ---
>>>>>>  block/bio.c         | 2 +-
>>>>>>  include/linux/bio.h | 3 ++-
>>>>>>  2 files changed, 3 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/block/bio.c b/block/bio.c
>>>>>> index 1f2cc1fbe283..dbe14d675f28 100644
>>>>>> --- a/block/bio.c
>>>>>> +++ b/block/bio.c
>>>>>> @@ -877,7 +877,7 @@ bool __bio_try_merge_page(struct bio *bio, struct page *page,
>>>>>>  		struct bio_vec *bv = &bio->bi_io_vec[bio->bi_vcnt - 1];
>>>>>>  
>>>>>>  		if (page_is_mergeable(bv, page, len, off, same_page)) {
>>>>>> -			if (bio->bi_iter.bi_size > UINT_MAX - len) {
>>>>>> +			if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len) {
>>>>>>  				*same_page = false;
>>>>>>  				return false;
>>>>>>  			}
>>>>>> diff --git a/include/linux/bio.h b/include/linux/bio.h
>>>>>> index 1edda614f7ce..0f49b354b1f6 100644
>>>>>> --- a/include/linux/bio.h
>>>>>> +++ b/include/linux/bio.h
>>>>>> @@ -20,6 +20,7 @@
>>>>>>  #endif
>>>>>>  
>>>>>>  #define BIO_MAX_PAGES		256
>>>>>> +#define BIO_MAX_SIZE		(BIO_MAX_PAGES * PAGE_SIZE)
>>>>>>  
>>>>>>  #define bio_prio(bio)			(bio)->bi_ioprio
>>>>>>  #define bio_set_prio(bio, prio)		((bio)->bi_ioprio = prio)
>>>>>> @@ -113,7 +114,7 @@ static inline bool bio_full(struct bio *bio, unsigned len)
>>>>>>  	if (bio->bi_vcnt >= bio->bi_max_vecs)
>>>>>>  		return true;
>>>>>>  
>>>>>> -	if (bio->bi_iter.bi_size > UINT_MAX - len)
>>>>>> +	if (bio->bi_iter.bi_size > BIO_MAX_SIZE - len)
>>>>>>  		return true;
>>>>>>  
>>>>>>  	return false;
>>>>>>
>>>>>
>>>>>
>>>>> -- 
>>>>> Damien Le Moal
>>>>> Western Digital Research
>>>>
>>>
>>>
>>> -- 
>>> Damien Le Moal
>>> Western Digital Research
>>>
>> 
>> ---
>> Changheun Lee
>> Samsung Electronics
>> 
>> 
>
>
>-- 
>Damien Le Moal
>Western Digital Research
>

---
Changheun Lee
Samsung Electronics