Just to be clear, did you run with "direct=1" on your fio tests? I
would also recommend disabling the librbd in-memory cache for random
IO tests against fast storage (rbd cache = false).
Yes, of course, I ran it with -direct=1 -sync=1.
Thanks for the hint, I retested it without the cache. The numbers are
closer now. 1456 iops with Q=1 against a single NVMe OSD and 17470 iops
with Q=128 against the whole cluster (with cache it's only 1350 and 9950
iops, respectively). There's definitely something wrong with this
cache...
Kernel client gives 1550 iops with Q=1 (single OSD) and 23530 iops
(whole cluster) with Q=128. I'm running it in a cluster currently in use
so the load varies, but krbd still seems faster.
Semi-recent performance tests against the librbd vs krbd against a
single NVMe-backed OSD showed 4K random write performance at around
75K MiBs/s for krbd and 90 MiBs/s for librbd -- but at the expense of
nearly 4x more client-side CPU for librbd. krbd pre-allocates a lot of
its up-front memory in slab allocation pools and zero-copies as much
as possible. librbd/librados are heavily hit w/ numerous C++ small
object heap allocations and related initialize/copy operations.
I can't achieve such numbers even with the rbd cache disabled. It only
gives me ~9500 iops, same for librbd and krbd when I'm testing it
against a single NVMe OSD. Probably the CPU is not that great (~8 years
old 2.2 ghz xeon), the benchmarked OSD eats ~650% CPU during testing.
Also I have message signatures disabled (cephx_sign_messages = false).
Without disabling them it was even worse...
--
With best regards,
Vitaliy Filippov