NVME RAID0 bottlenecked on single kworker thread?

Ryan <ryanrs@xxxxxxxxx> · Mon, 14 May 2018 12:55:28 -0700

I think I have run into a bottleneck somewhere in/around RAID and the
filesystem, and I am trying to determine if it's a known issue, or
something I am doing wrong. Briefly, I have 4x NVME SSDs. When
formatted as 4 separate ext4 filesystems, a sequential write benchmark
does 8 GB/s in aggregate, which is about what I expect. But when
striped together using mdadm raid0, seq write speed is less than 5
GB/s. In both cases I am using fio with 8 writer threads to 8 separate
files, libaio, iodepth 32, large buffers, no atime, and other sensible
options (see pastebin at bottom for details).

When using the drives separately, top looks like this:
top - 10:49:52 up 38 min,  4 users,  load average: 5.49, 2.42, 2.45
Tasks: 363 total,  15 running, 221 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.0 us, 54.6 sy,  0.0 ni, 42.5 id,  0.8 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 23252352 total,   428864 free,  1056968 used, 21766520 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 21538676 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
  3204 root      20   0  745432  37492   1064 R  99.9  0.2   0:35.64
fio
  3208 root      20   0  745456  37508   1076 R  99.8  0.2   0:35.62
fio
  3209 root      20   0  745464  37492   1056 R  99.7  0.2   0:35.65
fio
  3206 root      20   0  745444  37496   1060 R  99.6  0.2   0:35.54
fio
  3210 root      20   0  745468  37504   1076 R  99.6  0.2   0:35.64
fio
  3207 root      20   0  745452  37500   1068 R  99.4  0.2   0:35.54
fio
  3203 root      20   0  745428  37500   1068 R  99.2  0.2   0:35.59
fio
  3205 root      20   0  745440  37500   1068 R  99.0  0.2   0:35.43
fio
   158 root      20   0       0      0      0 R  89.4  0.0   5:09.55
kswapd0
  3027 root      20   0       0      0      0 R  60.9  0.0   0:44.53
kworker/u488:3
  3110 root      20   0       0      0      0 R  56.0  0.0   0:33.54
kworker/u488:0
  3026 root      20   0       0      0      0 R  54.4  0.0   1:22.48
kworker/u488:2
   375 root      20   0       0      0      0 R  50.6  0.0   2:50.14
kworker/u488:18

Note the 4 kworker threads doing filesystem or i/o stuff.

With raid0, top looks like this:
top - 11:17:37 up  1:06,  4 users,  load average: 9.13, 3.95, 2.64
Tasks: 355 total,   6 running, 225 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.1 us, 25.4 sy,  0.0 ni, 52.5 id, 20.9 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 23252352 total,   399740 free,  1042156 used, 21810456 buff/cache
KiB Swap:  2097148 total,  2097148 free,        0 used. 21553788 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+
COMMAND
  3027 root      20   0       0      0      0 R  99.2  0.0   6:19.05
kworker/u488:3
   158 root      20   0       0      0      0 S  85.0  0.0  11:33.52
kswapd0
  3487 root      20   0  745472  37580   1108 R  54.7  0.2   1:10.13
fio
  3481 root      20   0  745448  37568   1108 R  45.0  0.2   1:07.82
fio
  3485 root      20   0  745464  37572   1112 D  44.5  0.2   1:08.99
fio
  3483 root      20   0  745456  37572   1112 D  44.0  0.2   1:10.01
fio
  3486 root      20   0  745468  37572   1112 D  43.9  0.2   1:09.24
fio
  3484 root      20   0  745460  37564   1104 D  43.7  0.2   1:10.95
fio
  3488 root      20   0  745476  37584   1112 R  43.7  0.2   1:07.64
fio
  3482 root      20   0  745452  37572   1112 D  42.3  0.2   1:06.93
fio

Instead of 4 kworkers, there is now only 1.

In the raid case, it very much looks like all i/o is bottlenecking in
a single kernel worker thread. Is this a know issue/design choice?
Mostly I'm interested in knowing where the bottleneck is, rather
"fixing" it.

System is running Ubuntu 18.04 LTS / Linux Linux 4.15.0-20-generic
x86_64 on a 10-core Intel Xeon Silver 4114 2.2 GHz, 6-ch DDR4, 4x
Samsung 960 PRO 1TB NVME SSDs, each on a dedicated x4 PCIe 3 link from
the CPU (no PCIe switches / PCH lanes).

Full transcript of all mdadm, mkfs.ext4, fio, etc commands and output here:
https://pastebin.com/AXuXhurD

thanks,
-Ryan
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html