Re: [PATCH v10 01/17] iova: Export alloc_iova_fast() and free_iova_fast()

Jason Wang <jasowang@xxxxxxxxxx> · Tue, 10 Aug 2021 11:02:14 +0800

在 2021/8/9 下午1:56, Yongji Xie 写道:
On Thu, Aug 5, 2021 at 9:31 PM Jason Wang <jasowang@xxxxxxxxxx> wrote:

在 2021/8/5 下午8:34, Yongji Xie 写道:
My main point, though, is that if you've already got something else
keeping track of the actual addresses, then the way you're using an
iova_domain appears to be something you could do with a trivial bitmap
allocator. That's why I don't buy the efficiency argument. The main
design points of the IOVA allocator are to manage large address spaces
while trying to maximise spatial locality to minimise the underlying
pagetable usage, and allocating with a flexible limit to support
multiple devices with different addressing capabilities in the same
address space. If none of those aspects are relevant to the use-case -
which AFAICS appears to be true here - then as a general-purpose
resource allocator it's rubbish and has an unreasonably massive memory
overhead and there are many, many better choices.

OK, I get your point. Actually we used the genpool allocator in the
early version. Maybe we can fall back to using it.

I think maybe you can share some perf numbers to see how much
alloc_iova_fast() can help.

I did some fio tests[1] with a ram-backend vduse block device[2].

Following are some performance data:

                             numjobs=1   numjobs=2    numjobs=4   numjobs=8
iova_alloc_fast    145k iops      265k iops      514k iops      758k iops

iova_alloc            137k iops     170k iops      128k iops      113k iops

gen_pool_alloc   143k iops      270k iops      458k iops      521k iops

The iova_alloc_fast() has the best performance since we always hit the
per-cpu cache. Regardless of the per-cpu cache, the genpool allocator
should be better than the iova allocator.

I think we see convincing numbers for using iova_alloc_fast() than the 
gen_poll_alloc() (45% improvement on job=8).

Thanks

[1] fio jobfile:

[global]
rw=randread
direct=1
ioengine=libaio
iodepth=16
time_based=1
runtime=60s
group_reporting
bs=4k
filename=/dev/vda
[job]
numjobs=..

[2]  $ qemu-storage-daemon \
       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
       --monitor chardev=charmonitor \
       --blockdev
driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0
\
       --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128

The qemu-storage-daemon can be builded based on the repo:
https://github.com/bytedance/qemu/tree/vduse-test.

Thanks,
Yongji