Hi Tobias, MDRAID overhead is always there, but you can play with some tuning knobs. I recommend following: 1. You must use many thread/job with quite high QD configuration. Highest IOPS for Intel P3xxx drives achieved if you saturate them with 128 *4k IO per drive. This can be done in 32 jobs and QD4 or 16J/8QD and so on. With MDRAID on top of that, you should multiply by the number of drives in the array. So, I think currently the problem, that you’re simply not submitting enough IOs. 2. changing a HW SSD sector size to 4k may also help if you’re sure that your workload is always 4k granular 3. and finally using “imsm” MDRAID extensions and latest MDADM build. See some other hints there: http://www.slidesearchengine.com/slide/hands-on-lab-how-to-unleash-your-storage-performance-by-using-nvm-express-based-pci-express-solid-state-drives some config examples for NVMe are here: https://github.com/01org/fiovisualizer/tree/master/Workloads -- Andrey Kudryavtsev, SSD Solution Architect Intel Corp. inet: 83564353 work: +1-916-356-4353 mobile: +1-916-221-2281 On 1/23/17, 8:26 AM, "fio-owner@xxxxxxxxxxxxxxx on behalf of Tobias Oberstein" <fio-owner@xxxxxxxxxxxxxxx on behalf of tobias.oberstein@xxxxxxxxx> wrote: Hi, I have a question rgd Linux software RAID (MD) as tested with FIO - so this is slightly OT, but I am hoping for expert advice or redirection to a more appropriate place (if this is unwelcome here). I have a box with this HW: - 88 cores Xeon E7 (176 HTs) + 3TB RAM - 8 x Intel P3608 4TB NVMe (which is logicall 16 NVMes) With random 4kB read load, I am able to max it out at 7 million IOPS - but only if I run FIO on the _individual_ NVMe devices. [global] group_reporting filename=/dev/nvme0n1:/dev/nvme1n1:/dev/nvme2n1:/dev/nvme3n1:/dev/nvme4n1:/dev/nvme5n1:/dev/nvme6n1:/dev/nvme7n1:/dev/nvme8n1:/dev/nvme9n1:/dev/nvme10n1:/dev/nvme11n1:/dev/nvme12n1:/dev/nvme13n1:/dev/nvme14n1:/dev/nvme15n1 size=30G ioengine=sync iodepth=1 thread=1 direct=1 time_based=1 randrepeat=0 norandommap=1 bs=4k runtime=120 [randread] stonewall rw=randread numjobs=2560 When I create a stripe set over all devices: sudo mdadm --create /dev/md1 --chunk=8 --level=0 --raid-devices=16 \ /dev/nvme0n1 \ /dev/nvme1n1 \ /dev/nvme2n1 \ /dev/nvme3n1 \ /dev/nvme4n1 \ /dev/nvme5n1 \ /dev/nvme6n1 \ /dev/nvme7n1 \ /dev/nvme8n1 \ /dev/nvme9n1 \ /dev/nvme10n1 \ /dev/nvme11n1 \ /dev/nvme12n1 \ /dev/nvme13n1 \ /dev/nvme14n1 \ /dev/nvme15n1 I only get 1.6 million IOPS. Detail results down below. Note: the array is created with chunk size 8K because this is for database workload. Here I tested with 4k block size, but the it's similar (lower perf on MD) with 8k Any helps or hints would be greatly appreciated! Cheers, /Tobias 7 million IOPS on raw, individual NVMe devices ============================================== oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2560 threads Jobs: 2367 (f=29896): [_(2),f(3),_(2),f(11),_(2),f(2),_(9),f(1),_(1),f(1),_(3),f(1),_(1),f(1),_(13),f(1),_(8),f(1),_(1),f(4),_(2),f(1),_(1),f(1),_(3),f(2),_(3),f(3),_(8),f(2),_(1),f(3),_(3),f(60),_(1),f(20),_(1),f(33),_(1),f(14),_(1),f(18),_(4),f(6),_(1),f(6),_(1),f(1),_(1),f(1),_(1),f(4),_(1),f(2),_(1),f(11),_(1),f(11),_(4),f(74),_(1),f(8),_(1),f(11),_(1),f(8),_(1),f(61),_(1),f(38),_(1),f(31),_(1),f(5),_(1),f(103),_(1),f(24),E(1),f(27),_(1),f(28),_(1),f(1),_(1),f(134),_(1),f(62),_(1),f(48),_(1),f(27),_(1),f(59),_(1),f(30),_(1),f(14),_(1),f(25),_(1),f(2),_(1),f(25),_(1),f(31),_(1),f(9),_(1),f(7),_(1),f(8),_(1),f(13),_(1),f(28),_(1),f(7),_(1),f(84),_(1),f(42),_(1),f(5),_(1),f(8),_(1),f(20),_(1),f(15),_(1),f(19),_(1),f(3),_(1),f(19),_(1),f(7),_(1),f(17),_(1),f(34),_(1),f(1),_(1),f(4),_(1),f(1),_(1),f(1),_(2),f(3),_(1),f(1),_(1),f(1),_(1),f(8),_(1),f(6),_(1),f(3),_(1),f(3),_(1),f(53),_(1),f(7),_(1),f(19),_(1),f(6),_(1),f(5),_(1),f(22),_(1),f(11),_(1),f(12),_(1),f(3),_(1),f(16),_(1),f(149),_(1),f(20),_(1),f(27),_(1),f(7),_(1),f(29),_(1),f(2),_(1),f(11),_(1),f(46),_(1),f(8),_(2),f(1),_(1),f(1),_(1),f(14),E(1),f(4),_(1),f(22),_(1),f(11),_(1),f(70),_(2),f(11),_(1),f(2),_(1),f(1),_(1),f(1),_(1),f(21),_(1),f(8),_(1),f(4),_(1),f(45),_(2),f(1),_(1),f(18),_(1),f(12),_(1),f(6),_(1),f(5),_(1),f(27),_(1),f(3),_(1),f(3),_(1),f(19),_(1),f(4),_(1),f(25),_(1),f(4),_(1),f(1),_(1),f(2),_(1),f(1),_(1),f(13),_(1),f(18),_(1),f(1),_(1),f(1),_(1),f(29),_(1),f(27)][100.0%][r=21.1GiB/s,w=0KiB/s][r=5751k,w=0 IOPS][eta 00m:00s] randread: (groupid=0, jobs=2560): err= 0: pid=114435: Mon Jan 23 15:47:17 2017 read: IOPS=6965k, BW=26.6GiB/s (28.6GB/s)(3189GiB/120007msec) clat (usec): min=38, max=33262, avg=360.11, stdev=465.36 lat (usec): min=38, max=33262, avg=360.20, stdev=465.40 clat percentiles (usec): | 1.00th=[ 114], 5.00th=[ 135], 10.00th=[ 149], 20.00th=[ 171], | 30.00th=[ 191], 40.00th=[ 213], 50.00th=[ 239], 60.00th=[ 270], | 70.00th=[ 314], 80.00th=[ 378], 90.00th=[ 556], 95.00th=[ 980], | 99.00th=[ 2704], 99.50th=[ 3312], 99.90th=[ 4576], 99.95th=[ 5216], | 99.99th=[ 8096] lat (usec) : 50=0.01%, 100=0.11%, 250=53.75%, 500=34.23%, 750=5.23% lat (usec) : 1000=1.79% lat (msec) : 2=2.88%, 4=1.81%, 10=0.20%, 20=0.01%, 50=0.01% cpu : usr=0.63%, sys=4.89%, ctx=837434400, majf=0, minf=2557 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=835852266,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=26.6GiB/s (28.6GB/s), 26.6GiB/s-26.6GiB/s (28.6GB/s-28.6GB/s), io=3189GiB (3424GB), run=120007-120007msec Disk stats (read/write): nvme0n1: ios=52191377/0, merge=0/0, ticks=14400568/0, in_queue=14802400, util=100.00% nvme1n1: ios=52241684/0, merge=0/0, ticks=13919744/0, in_queue=15101276, util=100.00% nvme2n1: ios=52241537/0, merge=0/0, ticks=11146952/0, in_queue=12053112, util=100.00% nvme3n1: ios=52241416/0, merge=0/0, ticks=10806624/0, in_queue=11135004, util=100.00% nvme4n1: ios=52241285/0, merge=0/0, ticks=19320448/0, in_queue=21079576, util=100.00% nvme5n1: ios=52241142/0, merge=0/0, ticks=18786968/0, in_queue=19393024, util=100.00% nvme6n1: ios=52241000/0, merge=0/0, ticks=19610892/0, in_queue=20140104, util=100.00% nvme7n1: ios=52240874/0, merge=0/0, ticks=20482920/0, in_queue=21090048, util=100.00% nvme8n1: ios=52240731/0, merge=0/0, ticks=14533992/0, in_queue=14929172, util=100.00% nvme9n1: ios=52240587/0, merge=0/0, ticks=12854956/0, in_queue=13919288, util=100.00% nvme10n1: ios=52240447/0, merge=0/0, ticks=11085508/0, in_queue=11390392, util=100.00% nvme11n1: ios=52240301/0, merge=0/0, ticks=18490260/0, in_queue=20110288, util=100.00% nvme12n1: ios=52240097/0, merge=0/0, ticks=11377884/0, in_queue=11683568, util=100.00% nvme13n1: ios=52239956/0, merge=0/0, ticks=15205304/0, in_queue=16314628, util=100.00% nvme14n1: ios=52239766/0, merge=0/0, ticks=27003788/0, in_queue=27659920, util=100.00% nvme15n1: ios=52239620/0, merge=0/0, ticks=17352624/0, in_queue=17910636, util=100.00% 1.6 millions IOPS on Linux MD over 16 NVMe devices ================================================== oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ sudo /opt/fio/bin/fio postgresql_storage_workload.fio randread: (g=0): rw=randread, bs=4096B-4096B,4096B-4096B,4096B-4096B, ioengine=sync, iodepth=1 ... fio-2.17-17-g9cf1 Starting 2560 threads Jobs: 2560 (f=2560): [r(2560)][100.0%][r=6212MiB/s,w=0KiB/s][r=1590k,w=0 IOPS][eta 00m:00s] randread: (groupid=0, jobs=2560): err= 0: pid=146070: Mon Jan 23 17:21:15 2017 read: IOPS=1588k, BW=6204MiB/s (6505MB/s)(728GiB/120098msec) clat (usec): min=27, max=28498, avg=124.51, stdev=113.10 lat (usec): min=27, max=28498, avg=124.58, stdev=113.10 clat percentiles (usec): | 1.00th=[ 78], 5.00th=[ 84], 10.00th=[ 86], 20.00th=[ 89], | 30.00th=[ 95], 40.00th=[ 102], 50.00th=[ 105], 60.00th=[ 108], | 70.00th=[ 118], 80.00th=[ 133], 90.00th=[ 173], 95.00th=[ 221], | 99.00th=[ 358], 99.50th=[ 506], 99.90th=[ 2192], 99.95th=[ 2608], | 99.99th=[ 2960] lat (usec) : 50=0.06%, 100=35.14%, 250=61.83%, 500=2.46%, 750=0.19% lat (usec) : 1000=0.07% lat (msec) : 2=0.13%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01% cpu : usr=0.08%, sys=4.49%, ctx=200431993, majf=0, minf=2557 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwt: total=190730463,0,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): READ: bw=6204MiB/s (6505MB/s), 6204MiB/s-6204MiB/s (6505MB/s-6505MB/s), io=728GiB (781GB), run=120098-120098msec Disk stats (read/write): md1: ios=190632612/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%, aggrios=11920653/0, aggrmerge=0/0, aggrticks=1228287/0, aggrin_queue=1247601, aggrutil=100.00% nvme15n1: ios=11919850/0, merge=0/0, ticks=1214924/0, in_queue=1225896, util=100.00% nvme6n1: ios=11921162/0, merge=0/0, ticks=1182716/0, in_queue=1191452, util=100.00% nvme9n1: ios=11916313/0, merge=0/0, ticks=1265060/0, in_queue=1296728, util=100.00% nvme11n1: ios=11922174/0, merge=0/0, ticks=1206084/0, in_queue=1239808, util=100.00% nvme2n1: ios=11921547/0, merge=0/0, ticks=1238956/0, in_queue=1272916, util=100.00% nvme14n1: ios=11923176/0, merge=0/0, ticks=1168688/0, in_queue=1178360, util=100.00% nvme5n1: ios=11923142/0, merge=0/0, ticks=1192656/0, in_queue=1207808, util=100.00% nvme8n1: ios=11921507/0, merge=0/0, ticks=1250164/0, in_queue=1258956, util=100.00% nvme10n1: ios=11919058/0, merge=0/0, ticks=1294028/0, in_queue=1304536, util=100.00% nvme1n1: ios=11923129/0, merge=0/0, ticks=1246892/0, in_queue=1281952, util=100.00% nvme13n1: ios=11923354/0, merge=0/0, ticks=1241540/0, in_queue=1271820, util=100.00% nvme4n1: ios=11926936/0, merge=0/0, ticks=1190384/0, in_queue=1224192, util=100.00% nvme7n1: ios=11921139/0, merge=0/0, ticks=1200624/0, in_queue=1214240, util=100.00% nvme0n1: ios=11916614/0, merge=0/0, ticks=1230916/0, in_queue=1242372, util=100.00% nvme12n1: ios=11916963/0, merge=0/0, ticks=1266840/0, in_queue=1277600, util=100.00% nvme3n1: ios=11914399/0, merge=0/0, ticks=1262128/0, in_queue=1272988, util=100.00% oberstet@svr-psql19:~/scm/parcit/RA/adr/system/docs$ N�����r��y���b�X��ǧv�^�){.n�+�������?��ܨ}���Ơz�&j:+v���?����zZ+��+zf���h���~����i���z�?�w���?����&�)ߢ?f ��.n��������+%������w��{.n�������^n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�