Just to be clear, did you run with "direct=1" on your fio tests? I would also recommend disabling the librbd in-memory cache for random IO tests against fast storage (rbd cache = false). Semi-recent performance tests against the librbd vs krbd against a single NVMe-backed OSD showed 4K random write performance at around 75K MiBs/s for krbd and 90 MiBs/s for librbd -- but at the expense of nearly 4x more client-side CPU for librbd. krbd pre-allocates a lot of its up-front memory in slab allocation pools and zero-copies as much as possible. librbd/librados are heavily hit w/ numerous C++ small object heap allocations and related initialize/copy operations. On Mon, Apr 1, 2019 at 3:06 PM Виталий Филиппов <vitalif@xxxxxxxxxx> wrote: > > Interesting, thanks... but how does the krbd driver handle it? > > Also it doesn't seem to be a big bottleneck with small writes, at least I don't see ceph::buffer::copy in valgrind and perf profiles... > > 1 апреля 2019 г. 21:25:42 GMT+03:00, Jason Dillaman <jdillama@xxxxxxxxxx> пишет: >> >> The C librbd/librados APIs currently need to copy provided user >> buffer. There is a goal to remove this unnecessary copy once the >> underlying issue that necessitates the copy is addressed, but in the >> meantime, this CPU flame graphs will highlight that copy as a major >> consumer of the CPU for larger IOs [1]. There is also a lot of >> additional memory allocation and lock contention occurring in the >> user-space libraries that will also impact CPU and wall-clock time >> usage. >> >> On Mon, Apr 1, 2019 at 2:08 PM <vitalif@xxxxxxxxxx> wrote: >>> >>> >>> Hello, >>> >>> I've recently benchmarked random writes into the same RBD image in a >>> full-flash cluster using `fio -ioengine=librbd` and `fio >>> -ioengine=libaio` into krbd-mapped /dev/rbd0. >>> >>> The result was that with iodepth=1 librbd gave ~0.86ms latency and krbd >>> gave ~0.74ms; with iodepth=128 librbd gave ~9900 iops and krbd gave >>> ~17000 iops. That's a huge difference, it basically means a lot of >>> performance is wasted on the client side. >>> >>> Also it seems the performance impact does not come from librbd, it comes >>> directly from librados, because our ceph-bench / ceph-gobench tools give >>> almost identical write IOPS as librbd. >>> >>> My question is: could anyone make a guess about the thing that's >>> consuming so much CPU time in librados compared to the kernel rados >>> client? >>> >>> I tried to profile it with valgrind, from valgrind profiles it seems >>> it's mostly ceph::buffer::list::append and friends. Could it be the >>> right thing? >>> >>> -- >>> With best regards, >>> Vitaliy Filippov >> >> >> >> [1] https://github.com/ceph/ceph/pull/25689#issuecomment-472271162 > > > -- > With best regards, > Vitaliy Filippov -- Jason