Re: librbd 4k read/write?

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Fri, 11 Aug 2023 22:39:02 +0800

Hello Murilo,

This is an expected result, and it is not specific to Ceph. Any
storage that consists of multiple disks will produce a performance
gain over a single disk only if the workload allows for concurrent use
of these disks - which is not the case with your 4K benchmark due to
the de-facto missing readahead. The default readahead in Linux is just
128 kilobytes, and it means that even in a linear read scenario the
benchmark has no way to hit multiple RADOS objects at once. Reminder:
they are 4 megabytes in size by default with RBD.

To allow for faster linear reads and writes, please create a file,
/etc/udev/rules.d/80-rbd.rules, with the following contents (assuming
that the VM sees the RBD as /dev/sda):

KERNEL=="sda", ENV{DEVTYPE}=="disk", ACTION=="add|change",
ATTR{bdi/read_ahead_kb}="32768"

Or test it without any udev rule like this:

bloskdev --setra 65536 /dev/sda

The difference in numbers is because one is in kilobytes and one is in
512-byte sectors.

Mandatory warning: this setting can hurt other workloads.

On Thu, Aug 10, 2023 at 11:37 PM Murilo Morais <murilo@xxxxxxxxxxxxxx> wrote:
>
> Good afternoon everybody!
>
> I have the following scenario:
> Pool RBD replication x3
> 5 hosts with 12 SAS spinning disks each
>
> I'm using exactly the following line with FIO to test:
> fio -ioengine=libaio -direct=1 -invalidate=1 -name=test -bs=4M -size=10G
> -iodepth=16 -rw=write -filename=./test.img
>
> If I increase the blocksize I can easily reach 1.5 GBps or more.
>
> But when I use blocksize in 4K I get a measly 12 Megabytes per second,
> which is quite annoying. I achieve the same rate if rw=read.
>
> If I use librbd's cache I get a considerable improvement in writing, but
> reading remains the same.
>
> I already tested with rbd_read_from_replica_policy=balance but I didn't
> notice any difference. I tried to leave readahead enabled by setting
> rbd_readahead_disable_after_bytes=0 but I didn't see any difference in
> sequential reading either.
>
> Note: I tested it on another smaller cluster, with 36 SAS disks and got the
> same result.
>
> I don't know exactly what to look for or configure to have any improvement.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx