Block request size is limited to readahead size when doing buffered read to nvme over fabric target(remote nvme)

Daegyu Han <hdg9400@xxxxxxxxx> · Sat, 11 Apr 2020 23:31:20 +0900

Hi all, I wonder why the block layer generates requests that are
limited to readahead size when doing buffered read to nvmeof target
ssd.

I don't know it is okay to ask questions this mailing list.
I'm sorry if these personal questions were banned.

To describe in detail the environment I experimented with, it is as follows.

I used a Samsung nvme 970 ssd for storage and a Mellanox connectx-4
Infiniband for the network.
My server OS is kernel version 4.20.
I did buffered IO by using C language read() API and trace using blktrace.
NVMeoF ssd was formated as ext4.

C read API test
- I saw that the initiator sends requests to the target only as large
as the readahead(default: 128KB). After that, I changed the size of
the readahead through sysfs and the size of the request changed.
- In Direct IO, a request size set to a buffer (char array[]) size
created at the my user level program.

FIO test
- In the case fo buffered io, the request size changed according to
the block size. I think because I set the block size to 4K.
- Similarly, in the case of Direct IO, the request was made according
to the block size.

To sum up, more requests were completed in Buffered IO using C read
API than local IO to nvme ssd.
>From what I have measured, I think that nvmeof in buffered io is worse
than local performance due to requests split by readahead size.
I tried to analyze the blk-mq and nvme code, but these layers are too
broad and difficult to understand.

Why is the request size set to readhead size when buffered IO is
performed from target nvmeof using C read API?
I want to know the reason and which code makes the block request.

I had my trace logs. If you want to see logs, I will attach my logs.

Thank you.