RDMA performance comparison: IBNBD, SCST, NVMEoF

Roman Penyaev <roman.penyaev@xxxxxxxxxxxxxxxx> · Tue, 18 Apr 2017 19:33:25 +0200

Hi Bart, Sagi and all,

By current email I would like to share some fresh RDMA performance
results of IBNBD, SCST and NVMEof, based on 4.10 kernel and variety
of configurations.

All fio runs are grouped by the name of a project, crucial config
differencies (e.g. CPU pinning or register_always=N) and two testing
modes: MANY-DISKS and MANY-JOBS.  In each group of results amount of
simultaneous fio jobs is increasing starting from 1 up to 128.  E.g.
in MANY-DISKS testing mode 1 fio job is dedicated to 1 disk, where
amount of jobs (and disks) is growing, in its turn, in MANY-JOBS
testing mode each fio job produces IO for the same disk, i.e.:

  MANY-DISKS:
     x1:
         numjobs=1
         [job1]
         filename=/dev/nvme0n1
     ...
     x128:
         numjobs=1
         [job1]
         filename=/dev/nvme0n1
         [job2]
         filename=/dev/nvme0n2
         ...
         [job128]
         filename=/dev/nvme0n128

  MANY-JOBS:
     x1:
         numjobs=1
         [job1]
         filename=/dev/nvme0n1
     ...
     x128:
         numjobs=128
         [job1]
         filename=/dev/nvme0n1

Each group of results represents itself as a performance measurement,
which can be easily plotted, taking number of jobs as X axis and iops,
overall IO latencies or anything else extracted from fio json result
files as Y axis.

FIO configurations were generated and saved along with produced fio
json results by the fio-runner.py script [1].  Complete archive with
FIO configs and results can be downloaded here [2].

The following metrics were taken from fio json results:

    write/iops     - IOPS
    write/lat/mean - average latency (μs)

Here I would like to present reduced results table taking into account
only runs with CPU pinning in MANY-DISKS testing mode, since CPU pinning
makes more sense in terms of performance and MANY-DISKS and MANY-JOBS
results look very much similar:

write/iops (MANY-DISKS)
      IBNBD_pin   NVME_noreg_pin  NVME_pin    SCST_noreg_pin  SCST_pin
x1    80398.96    75577.24        54110.19    59555.04        48446.05
x2    109018.60   96478.45        69176.77    73925.81        55557.59
x4    169164.56   140558.75       93700.96    75419.91        56294.61
x8    197725.44   159887.33       99773.05    79750.92        55938.84
x16   176782.36   150448.33       99644.05    92964.23        56463.14
x32   139666.00   123198.38       81845.30    81287.98        50590.86
x64   125666.16   82231.77        72117.67    72023.32        45121.17
x128  120253.63   73911.97        65665.08    74642.27        47268.46

write/lat/mean (MANY-DISKS)
      IBNBD_pin   NVME_noreg_pin  NVME_pin    SCST_noreg_pin  SCST_pin
x1    647.78      697.91          1032.97     925.51          1173.04
x2    973.20      1104.38         1612.75     1462.18         2047.11
x4    1279.49     1528.09         2452.22     3188.41         4235.95
x8    2356.92     2929.87         4891.70     6248.85         8907.10
x16   5605.62     6575.70         10046.4     10830.50        17945.57
x32   14489.54    16516.60        24849.16    24984.26        40335.09
x64   32364.39    49481.42        56615.23    56559.02        90590.84
x128  67570.88    110768.70       124249.4    109321.84       171390.00

    * Where suffixes mean:

     _pin   - CPU pinning
     _noreg - modules on initiator side (ib_srp, nvme_rdma) were loaded
              with 'register_always=N' param

Complete table results and corresponding graphs are presented on Google
sheet [3].

Conclusion:
    IBNBD outperforms in average by:

                     NVME_noreg_pin  NVME_pin  SCST_noreg_pin  SCST_pin
       iops          41%             72%       61%             155%
       lat/mean      28%             42%       38%             60%

       * Complete tables results [3] were taken into account for average
         percentage calculation.

Test setup is the following:

Initiator and target HW configuration:

    AMD Opteron 6386 SE, 64CPU, 128Gb
    InfiniBand: Mellanox Technologies MT26428
                [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE]

Initiator and target SW configuration:

    vanilla Linux 4.10
    + IBNBD patches
    + SCST from https://github.com/bvanassche/scst, master branch

Initiator side:

    IBNBD and NVME: MQ mode
    SRP: default RQ, on attempt to set 'use_blk_mq=Y' IO hangs.

    FIO generic configuration pattern:

        bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4
        fadvise_hint=0
        rw=randrw:2
        direct=1
        random_distribution=zipf:1.2
        time_based=1
        runtime=10
        ioengine=libaio
        iodepth=128
        iodepth_batch_submit=128
        iodepth_batch_complete=128
        group_reporting

Target side:

    128 null_blk devices with default configuration, opened as blockio.

NVMEoF configuration script [4].
SCST configuration script [5].

Would be great to receive any feedback.  I am open for further perf
tuning and testing with other possible configurations and options.

Thanks.

--
Roman

[1] FIO runner and results extractor script:
    https://drive.google.com/open?id=0B8_SivzwHdgSS2RKcmc4bWg0YjA

[2] Archive with FIO configurations and results:
    https://drive.google.com/open?id=0B8_SivzwHdgSaDlhMXV6THhoRXc

[3] Google sheet with performance measurements:
    https://drive.google.com/open?id=1sCTBKLA5gbhhkgd2USZXY43VL3zLidzdqDeObZn9Edc

[4] NVMEoF configuration:
    https://drive.google.com/open?id=0B8_SivzwHdgSTzRjbGtmaVR6LWM

[5] SCST configuration:
    https://drive.google.com/open?id=0B8_SivzwHdgSM1B5eGpKWmFJMFk