Re: 4x lower IOPS: Linux MD vs indiv. devices - why?

Tobias Oberstein <tobias.oberstein@xxxxxxxxx> · Mon, 23 Jan 2017 22:22:50 +0100

Am 23.01.2017 um 21:24 schrieb Sitsofe Wheeler:
On 23 January 2017 at 19:40, Tobias Oberstein
<tobias.oberstein@xxxxxxxxx> wrote:
Am 23.01.2017 um 20:13 schrieb Sitsofe Wheeler:

On 23 January 2017 at 18:33, Tobias Oberstein
<tobias.oberstein@xxxxxxxxx> wrote:

libaio is nowhere near what I get with engine=sync and high job counts.
Mmh.
Plus the strange behavior.

Have you tried batching the IOs and controlling how much are you
reaping at any one time? See

http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-iodepth_batch_submit
for some of the options for controlling this...

Thanks! Nice.

For libaio, and with all the hints applied (no 4k sectors yet), I get (4k
randread)

Individual NVMes: iops=7350.4K
MD (RAID-0) over NVMes: iops=4112.8K

The going up and down of IOPS is gone.

It's becoming more apparent I'd say, that tthere is a MD bottleneck though.

If you're "just" trying for higher IOPS you can also try gtod_reduce
(see http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-gtod_reduce
). This subsumes things like disable_lat but you'll get fewer and less
accurate measurement stats back. With libaio userspace reap
(http://fio.readthedocs.io/en/latest/fio_doc.html#cmdoption-arg-userspace_reap
) can sometimes nudge numbers up but at the cost of CPU.

Using that option plus bumping to QD=64 and batch submit 16, I get

plain NVMes:   iops=7415.9K
MD over NVMes: iops=4112.4K

These are staggering numbers for sure!

In fact, the Intel P3608 4TB datasheet says: up to 850k random 4kB

Since we have 8 (physical) of these, the real world measurement (7.4 
mio) is even above the datasheet (6.8 mio).

I'd say: very good job Intel =)

The price of course is the CPU load to reach these numbers .. we have 
the 2nd largest Intel Xeon available

Intel(R) Xeon(R) CPU E7-8880 v4 @ 2.20GHz

and 4 of these .. and even that isn't enough to saturate these NVMe 
beasts while still having room to do useful work (PostgreSQL).

So we're gonna be CPU bound .. again - this is the 2nd iteration of such 
a box. The first one has 48 cores E7 v2 and 8 x P3700 2TB. Also CPU 
bound on PostgreSQL anyway .. with 3TB RAM.

Cheers,
/Tobias

randread-individual-nvmes: (groupid=0, jobs=128): err= 0: pid=37454: Mon 
Jan 23 22:12:30 2017
  read : io=869361MB, bw=28968MB/s, iops=7415.9K, runt= 30011msec
  cpu          : usr=6.14%, sys=64.55%, ctx=59170293, majf=0, minf=8320

randread-md-over-nvmes: (groupid=1, jobs=128): err= 0: pid=37582: Mon 
Jan 23 22:12:30 2017
  read : io=481982MB, bw=16064MB/s, iops=4112.4K, runt= 30004msec
  cpu          : usr=3.88%, sys=95.88%, ctx=14209, majf=0, minf=6784

[global]
group_reporting
size=30G
ioengine=libaio
iodepth=64
iodepth_batch_submit=16
thread=1
direct=1
time_based=1
randrepeat=0
norandommap=1
disable_lat=1
gtod_reduce=1
bs=4k
runtime=30

[randread-individual-nvmes]
stonewall
filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1
rw=randread
numjobs=128

[randread-md-over-nvmes]
stonewall
filename=/dev/md1
rw=randread
numjobs=128

--
To unsubscribe from this list: send the line "unsubscribe fio" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html