[RFC] messenger: epoll events consumed from userspace, perf tweaks #1

Roman Penyaev <rpenyaev@xxxxxxx> · Wed, 27 Mar 2019 16:45:51 +0100

Hi all,

I would like to publish performance messenger results based on my RFC 
patch
for kernel [1], which makes possible to consume epoll events from a ring
buffer directly from userspace, avoiding userspace-kernel transition on 
a
hot path.  All measurements are done using fio messenger engine [2], 
which
tests transport stack only.  Implementation of a pollable epoll() from
userspace is embedded in msg/async/EventEpoll.cc [3] and is optional:
ms_uepoll should be set to true in order to enable the feature.

Fio config and generic config is the following:

    localhost, async+posix
    iodepth=128

master, ms_uepoll=false

   1k  IOPS=110k,  BW=107MiB/s,  Lat=1.164ms
   2k  IOPS=118k,  BW=231MiB/s,  Lat=1.080ms
   4k  IOPS=111k,  BW=433MiB/s,  Lat=1.153ms
   8k  IOPS=100k,  BW=784MiB/s,  Lat=1.274ms
  16k  IOPS=94.2k, BW=1472MiB/s, Lat=1.357ms
  32k  IOPS=85.4k, BW=2669MiB/s, Lat=1.497ms
  64k  IOPS=63.3k, BW=3956MiB/s, Lat=2.021ms
 128k  IOPS=32.0k, BW=4001MiB/s, Lat=3.996ms
 256k  IOPS=17.0k, BW=4256MiB/s, Lat=7.516ms

master, ms_uepoll=true

   1k  IOPS=136k,  BW=132MiB/s,  Lat=0.944ms
   2k  IOPS=132k,  BW=258MiB/s,  Lat=0.968ms
   4k  IOPS=124k,  BW=484MiB/s,  Lat=1.032ms
   8k  IOPS=109k,  BW=853MiB/s,  Lat=1.171ms
  16k  IOPS=96.8k, BW=1513MiB/s, Lat=1.321ms
  32k  IOPS=91.0k, BW=2844MiB/s, Lat=1.405ms
  64k  IOPS=62.3k, BW=3894MiB/s, Lat=2.054ms
 128k  IOPS=32.4k, BW=4044MiB/s, Lat=3.953ms
 256k  IOPS=17.2k, BW=4289MiB/s, Lat=7.459ms

So avoiding transition from userspace to kernel gives significant gain 
up
to 32k block sizes:

   1k  +23%
   2k  +11%
   4k  +11%
   8k  +9%
  16k  +2%
  32k  +7%
  64k  -1%
 128k  +1%
 256k  +1%

Non optimal messenger loop implementation on write path and memcopies of
big blocks trashes the gain starting from 64k block sizes.  The 
following
PR [4] makes write queue more efficient, especially when ms_uepoll is
enabled:

messenger-wqueue, ms_uepoll=true

   1k  IOPS=196k, BW=191MiB/s,   Lat=0.653ms
   2k  IOPS=187k, BW=364MiB/s,   Lat=0.685ms
   4k  IOPS=165k, BW=644MiB/s,   Lat=0.776ms
   8k  IOPS=135k, BW=1051MiB/s,  Lat=0.951ms
  16k  IOPS=123k, BW=1927MiB/s,  Lat=1.035ms
  32k  IOPS=101k, BW=3148MiB/s,  Lat=1.270ms
  64k  IOPS=74.3k, BW=4646MiB/s, Lat=1.721ms
 128k  IOPS=33.6k, BW=4196MiB/s, Lat=3.811ms
 256k  IOPS=17.3k, BW=4312MiB/s, Lat=7.419ms

So comparing to original master IOPS increase in percentage:

   1k  +78%
   2k  +58%
   4k  +48%
   8k  +35%
  16k  +30%
  32k  +18%
  64k  +17%
 128k  +5%
 256k  +1%

Currently uepoll descriptor is created with fixed ring buffer space for 
1024
descriptors only.  This limitation can be easily overcome on 
EventEpoll.cc
side, but this is not done yet.

[1] https://lwn.net/Articles/777263/
[2] https://github.com/ceph/ceph/pull/24678
[3] https://github.com/rouming/ceph/commits/messenger-uepoll
[4] https://github.com/ceph/ceph/pull/26932

--
Roman