Hi all,
I would like to publish performance messenger results based on my RFC
patch
for kernel [1], which makes possible to consume epoll events from a ring
buffer directly from userspace, avoiding userspace-kernel transition on
a
hot path. All measurements are done using fio messenger engine [2],
which
tests transport stack only. Implementation of a pollable epoll() from
userspace is embedded in msg/async/EventEpoll.cc [3] and is optional:
ms_uepoll should be set to true in order to enable the feature.
Fio config and generic config is the following:
localhost, async+posix
iodepth=128
master, ms_uepoll=false
1k IOPS=110k, BW=107MiB/s, Lat=1.164ms
2k IOPS=118k, BW=231MiB/s, Lat=1.080ms
4k IOPS=111k, BW=433MiB/s, Lat=1.153ms
8k IOPS=100k, BW=784MiB/s, Lat=1.274ms
16k IOPS=94.2k, BW=1472MiB/s, Lat=1.357ms
32k IOPS=85.4k, BW=2669MiB/s, Lat=1.497ms
64k IOPS=63.3k, BW=3956MiB/s, Lat=2.021ms
128k IOPS=32.0k, BW=4001MiB/s, Lat=3.996ms
256k IOPS=17.0k, BW=4256MiB/s, Lat=7.516ms
master, ms_uepoll=true
1k IOPS=136k, BW=132MiB/s, Lat=0.944ms
2k IOPS=132k, BW=258MiB/s, Lat=0.968ms
4k IOPS=124k, BW=484MiB/s, Lat=1.032ms
8k IOPS=109k, BW=853MiB/s, Lat=1.171ms
16k IOPS=96.8k, BW=1513MiB/s, Lat=1.321ms
32k IOPS=91.0k, BW=2844MiB/s, Lat=1.405ms
64k IOPS=62.3k, BW=3894MiB/s, Lat=2.054ms
128k IOPS=32.4k, BW=4044MiB/s, Lat=3.953ms
256k IOPS=17.2k, BW=4289MiB/s, Lat=7.459ms
So avoiding transition from userspace to kernel gives significant gain
up
to 32k block sizes:
1k +23%
2k +11%
4k +11%
8k +9%
16k +2%
32k +7%
64k -1%
128k +1%
256k +1%
Non optimal messenger loop implementation on write path and memcopies of
big blocks trashes the gain starting from 64k block sizes. The
following
PR [4] makes write queue more efficient, especially when ms_uepoll is
enabled:
messenger-wqueue, ms_uepoll=true
1k IOPS=196k, BW=191MiB/s, Lat=0.653ms
2k IOPS=187k, BW=364MiB/s, Lat=0.685ms
4k IOPS=165k, BW=644MiB/s, Lat=0.776ms
8k IOPS=135k, BW=1051MiB/s, Lat=0.951ms
16k IOPS=123k, BW=1927MiB/s, Lat=1.035ms
32k IOPS=101k, BW=3148MiB/s, Lat=1.270ms
64k IOPS=74.3k, BW=4646MiB/s, Lat=1.721ms
128k IOPS=33.6k, BW=4196MiB/s, Lat=3.811ms
256k IOPS=17.3k, BW=4312MiB/s, Lat=7.419ms
So comparing to original master IOPS increase in percentage:
1k +78%
2k +58%
4k +48%
8k +35%
16k +30%
32k +18%
64k +17%
128k +5%
256k +1%
Currently uepoll descriptor is created with fixed ring buffer space for
1024
descriptors only. This limitation can be easily overcome on
EventEpoll.cc
side, but this is not done yet.
[1] https://lwn.net/Articles/777263/
[2] https://github.com/ceph/ceph/pull/24678
[3] https://github.com/rouming/ceph/commits/messenger-uepoll
[4] https://github.com/ceph/ceph/pull/26932
--
Roman